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In recent years, deep learning has infiltrated every 
field it has touched, reducing the need for specialist 
knowledge and automating the process of knowledge 
discovery from data. This review argues that astronomy 
is no different, and that we are currently in the midst 
of a deep learning revolution that is transforming 
the way we do astronomy. We trace the history 
of astronomical connectionism from the early days 
of multilayer perceptrons, through the second wave 
of convolutional and recurrent neural networks, 
to the current third wave of self-supervised and 
unsupervised deep learning. We then predict that 
we will soon enter a fourth wave of astronomical 
connectionism, in which finetuned versions of an 
all-encompassing ‘foundation’ model will replace 
expertly crafted deep learning models. We argue that 
such a model can only be brought about through 
a symbiotic relationship between astronomy and 
connectionism, whereby astronomy provides high 
quality multimodal data to train the foundation 
model, and in turn the foundation model is used to 
advance astronomical research. 


1. Introduction 


The concept of artificial intelligence (AI) can be traced 
back at least 350 years to Leibniz’s Dissertation on 
the Art of Combinations [1]. Inspired by Descartes and 
Llull, Leibniz posited that, through the development of 
a ‘universal language,’ all ideas could be represented 
by the combination of a small set of fundamental 
concepts, and that new concepts could be generated in a 
logical fashion, potentially by some computing machine. 
Leibniz’s ambitious vision (‘let us calculate’) has not yet 
been realised, but the quest to emulate human reasoning, 
or at least to build a machine to mimic the computational 
and data processing capabilities of the human brain, has 
persisted to this day. 
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It might be fair to say that the roots of AI stretch even as far back as Llull’s medieval philosophy 
that inspired Leibniz [2,3]. However, if we now consider AI to be a bona fide scientific discipline, 
then that discipline clearly emerged in the post-war years of the twentieth century, following 
Turing’s simple enquiry ‘can machines think?’ [4]. Somewhat philosophical in nature, Turing’s 1950 
question succinctly articulates the ambition of AI, but from a nuts and bolts standpoint it took a 
further five years from Turing’s query for what one might call the first AI program — the so-called 
‘Logic Theorist’ — to be developed by Allen Newell, Cliff Shaw, and Herbert Simon. Funded by 
the Research and Development (RAND) Corporation, the Logic Theorist was designed, in part, to 
emulate the role of a human mathematician, in that it could automate the proof of mathematical 
theorems. This was a breakthrough in computer science and the Logic Theorist was presented at 
the seminal Dartmouth Summer Research Project on Artificial Intelligence (DSRPAI) conference 
in 1956, now regarded as the true birth of AI as a field. Indeed, it was DSRPAI organiser John 
McCarthy who is credited with coining the term ‘artificial intelligence’ [5]. 

Natural intrigue — and clearly a good deal of fear — of the idea of AI has inspired popular 
culture no end, from Dick’s Do Androids Dream Of Electric Sheep? to Crichton’s Westworld, 
Terminator’s ‘Skynet’ and beyond. Iain M. Banks’s Galactic civilisation known as ‘The Culture’ 
imagines a society run by powerful ‘Minds’ whose intelligence and wisdom far exceeds that of 
humans, and where biological beings and machines of equivalent sentience generally co-exist 
peacefully, cooperatively, and equitably. Science fiction notwithstanding, if these dreams are even 
possible, we are still many years away from a machine that can genuinely think for itself [6,7]. 
Nevertheless, the question of how one mathematically (and algorithmically) models the workings 
and inter-relationships of biological neurons — neural networks — and the subsequent exploration 
of how they can find utility as tools in the data analyst’s workshop is really what is being referred 
to when most people use the term ‘AI’ today!. While we must always be wary of hype and 
buzzwordism, it is the application of neural networks — and the possibility of tackling hitherto 
intractable problems — that offers genuine reason for excitement across many disparate fields of 
enquiry, including astronomy. 

Astronomers have made use of artificial neural networks (ANNs) for over three decades. In 
1994, Ofer Lahav, an early trailblazer, wryly identified the ‘neuro-skeptics’ — those resistant to 
the use of such techniques in serious astrophysics research — and argued that ANNs ‘should be 
viewed as a general statistical framework, rather than as an estoteric approach’ [8]. Unfortunately, this 
skepticism has persisted, despite the recent upsurge in the use of neural networks (and machine 
learning in general) in the field, as illustrated in Fig. 1. Most of the criticism of machine learning 
techniques, and deep learning” in particular, is levelled at the perceived ‘black box’ nature of the 
methodology. In this review we provide a primer on how deep neural networks are constructed, 
and the mathematical rules governing their learning, which we hope will serve as a useful 
resource for neuro-skeptics. Nevertheless, we must recognise that a unified theoretical picture 
of how deep neural networks work does not yet exist. This remains a point of debate even within 
the deep learning community. For example, Yann LeCun responding to Ali Rahimi’s ‘Test of Time’ 
award talk at the 31st Conference on Neural Information Processing Systems (NIPS) remarked: 


Ali gave an entertaining and well-delivered talk. But I fundamentally disagree with the message. 
The main message was, in essence, that the current practice in machine learning is akin to ‘alchemy’ 
(his word). It’s insulting, yes. But never mind that: It’s wrong! Ali complained about the lack of 
(theoretical) understanding of many methods that are currently used in ML, particularly in deep 
learning ... Sticking to a set of methods just because you can do theory about it, while ignoring a set 
of methods that empirically work better just because you don’t (yet) understand them theoretically 
is akin to looking for your lost car keys under the street light knowing you lost them someplace else. 
Yes, we need better understanding of our methods. But the correct attitude is to attempt to fix the 
situation, not to insult a whole community for not having succeeded in fixing it yet. This is like 
criticizing James Watt for not being Carnot or Helmholtz. [9] 


1 And the term is regularly misused, not only erroneously, but often cynically. 
Deep learning referring to the use of a network constructed of many layers of artificial neurons. 
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Figure 1: Here we see the number of arXiv:astro-ph submissions per month that have abstracts 
or titles containing one or more of the strings: ‘machine learning’, ‘MU’, ‘artificial intelligence’, 
‘AI’, ‘deep learning’, or ‘neural network’. The raw data is in the public domain and is available at 
https://www.kaggle.com/Cornell-University/arxiv. 


Philosophical concerns aside, LeCun’s fundamental point is that deep learning ‘works’ and 
therefore we should use it, even if we do not fully understand it. If one were being uncharitable, 
we could make similar arguments about the ACDM paradigm. 

It is clear that in every field that deep learning has infiltrated we have seen a reduction in the 
use of specialist knowledge, to be replaced with knowledge automatically derived from data. We 
have already seen this process play out in many ‘applied deep learning’ fields such as computer 
Go [10], protein folding [11], natural language processing [12], and computer vision [13]. We 
argue that astronomy’s data abundance corrals it onto a path no different to that trodden by 
other applied deep learning fields. This abundance is not a passing phase; the total astronomical 
data volume is already large and will increase exponentially in the coming years. We illustrate 
this in Fig. 2, where we present a selection of astronomical surveys and their estimated data 
volume output over their lifetimes [14]. And this is not even considering data associated with ever 
larger and more detailed numerical simulations [15]. The current scale of the data volume already 
poses an issue for astronomy as many classical methods rely on human supervision and specialist 
expertise, and the increasing data volume will make exploring and exploiting these surveys 
through traditional human supervised and semi-supervised means an intractable problem. Of 
serious concern is the possibility that we will miss — or substantially delay — interesting and 
important discoveries simply due to our inability to accurately and consistently interrogate 
astronomical data at scale. Deep learning has shown great promise in automating information 
extraction in various data intensive fields, and so is ideally poised as a solution to the challenge 
of processing ultra-large scale astronomical data. But we do not need to stop there. This review’s 
outlook ventures a step further, and argues that astronomy’s wealth of data should be considered 
a unique opportunity, and not merely an albatross. 


e 
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Since astronomical connectionism’s” humble beginnings in the late 1980s, there have been 
numerous excellent reviews on the application of artificial neural networks to astronomy [e.g. 
16-18]. We take an alternative approach to previous literature reviews and survey the field 
holistically, in an attempt to paint astronomical connectionism’s ‘Big Picture’ with broad strokes. 
While we cannot possibly include all works within astronomical connectionism*, we hope that 
this review serves as a historical background on astronomy’s ‘three waves’ of increasingly 
automated connectionism, as well as presenting a general primer on neural networks that may 
assist those seeking to explore this fascinating topic for the first time. 
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Figure 2: The data volume output of a selection of astronomical surveys over their lifetimes. We 
can see the astronomical survey data volume doubles every 16 months. Data is taken from Zhang 
and Zhao [14]. 


In §2 and §3 we explore initial work on multilayer perceptrons within astronomy, where 
models required manually selected emergent properties as input. In §4 and §5 we explore the 
second wave, which coincided with the dissemination of convolutional neural networks and 
recurrent neural networks — models where the multilayer perceptron’s manually selected inputs 
are replaced with raw data ingestion. In the third wave that is happening now we are seeing 
the removal of human supervision altogether with deep learning methods inferring labels and 
knowledge directly from the data, and we explore this wave in §6-§8. Finally, in §9, we look to the 
future and predict that we will soon enter a fourth wave of astronomical connectionism. We argue 
that if astronomy follows the pattern of other applied deep learning fields we will see the removal 
of expertly crafted deep learning models, to be replaced with fine-tuned versions of an all- 
encompassing ‘foundation’ model. As part of this fourth wave we argue for a symbiosis between 
astronomy and connectionism, a symbiosis predicated on astronomy’s relative data wealth and 
deep learning’s insatiable data appetite. Many ultra-large datasets in machine learning are 
proprietary or of poor quality, and so there is an opportunity for astronomers as a community 
to develop and provide a high quality multimodal public dataset. In turn, this dataset could be 
3s a shorthand, in this review we will refer to ‘parallel distributed processing’ — the dominant connectionist paradigm — as 


simply ‘connectionism’. 
4We refer the reader to Fig. 1! 
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used to train an astronomical foundation model to serve state-of-the-art downstream tasks. Due 
to foundation models’ hunger for data and compute, a single astronomical research group could 
not bring about such a model alone. Therefore, we conclude that astronomy as a discipline has 
slim chance of keeping up with a research pace set by the Big Tech goliaths — that is, unless we 
follow the examples of EleutherAI and HuggingFace and pool our resources in a grassroots open 
source fashion. 

Before moving on, we must first admit to our readers that we have not been entirely honest 
with them. The abstract of this review has not been written by us. It has actually been generated 
by prompting OpenAl’s ‘GPT-3’ neural network based foundation model [12] with the abstract 
from MJS’s PhD thesis [19]! We explore these models in more detail in §9. 


2. Aprimer on artificial neurons 


In 1943, McCulloch and Pitts proposed the first computational model of a biological neuron [MP 
neuron; 20]. Their model consisted of a set of binary inputs x; € {0, 1} and a single binary output 
y € {0, 1}. Their model also defines a single ‘inhibitory’ input Z € {0, 1} that blocks output if Z = 1. 
If the sum of the inputs exceeds a threshold value O, the MP neuron ‘fires’ and outputs y = 1. 
Mathematically we can write the MP neuron function as 


MP(x) = 1 if 1 wi > O and T=0, 
0 otherwise. 


The MP neuron is quite a powerful abstraction. Single MP neurons can calculate simple boolean 
functions, and more complicated functions can be calculated when many MP neurons are chained 
together. However, there is one show-stopping issue: the MP neuron is missing the capacity to 
learn. Rosenblatt [21] addressed this by combining the MP neuron with Hebb’s neuronal wiring 
theory” [22], and we will explore Rosenblatt’s formulation in the next subsection. 


2(a) The perceptron 


Like the MP neuron, Rosenblatt’s perceptron takes a number of numeric inputs (x;). However, 
unlike the MP neuron each one of these inputs is multiplied by a corresponding weight (w;) 
signifying the importance the perceptron assigns to a given input. As shown in Fig. 3, Rosenblatt 
then sums this list of products and passes it into an ‘activation function’. Rosenblatt used the 
Heaviside step function in his original formulation: 


1 
0.5 ; 
0 ifw- 0, 
prediction = H(w- x) = EN (2.1) 
1 ifw-x>1. 
—5 0 5 
w-x 


To concretise exactly how Rosenblatt’s perceptron learns we will use an example. Let us say 
that we want to automatically label a set of galaxy images as either ‘spiral’ or ‘elliptical’. To 
do this we first need to compile a training dataset of galaxy images. This training set would 
consist of spiral and elliptical galaxies, and each image would have a ground truth label y — 
say ‘0’ for a spiral galaxy and ‘1’ for an elliptical. To train our perceptron we randomly choose 


5 Also known by the mantra ‘cells that fire together wire together’. 
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one image from the training set, and feed it to the perceptron, with the numerical value of each 
pixel corresponding to an input {x1,..., £ y }. These inputs are multiplied by their corresponding 
weight {w1,..., wy}. A bias term (b = wo zo, where xp = 1) is also added to the inputs, which 
allows the neuron to shift its activation function linearly. Since we do not want our perceptron to 
have any prior knowledge of the task, we initialise the weights at random. The resulting products 
are then summed. Finally, our activation function H transforms w - x and produces a prediction 
p. We then compare p to y via a ‘loss function,’ which is a function that measures the difference 
between p and y. The loss can be any differentiable function, so for illustration purposes we will 
define it here as the L1 loss: L(y, p) = |y — p|. Now that we can compare to the ground truth, we 
need to work out how a change in one of our weights affects the loss (that is, we want to find 
OL/Ow). We can calculate this change with the chain rule 


ƏL _ ƏL dp 


Iw Bp Ow’ G2) 


and since p = H (w - x) and dp/Ow = H’x" we get 


where © is the distributive Hadamard product. Thus we can update the weights to decrease the 
loss function: 


Wnext = W — "Oy 
=w—- te © (H'x"), 


where 77 is the learning rate°. If we repeat this process our perceptron will get better and better at 
classifying our galaxies! 


prediction 


Figure 3: A single neuron (or perceptron) with a bias wo, inputs x1, 22,...,7y, and weights 
W1,W2,---,WN- 


The eagle-eyed reader may have noticed that since the derivative of the Heaviside step function is the Dirac delta function, 
we will only update the perceptron’s weights on an incorrect prediction. If we want to also learn from positive examples, we 
need to use a smoothly differentiable activation function. This is explored in the next subsection. 
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We must go further than training a single layer perceptron; in Perceptrons: An Introduction to 
Computational Geometry, Minsky and Papert [e.g. §13.0; 23] show that the single layer perceptron is 
only able to calculate linearly separable functions, among other limitations. Their book (alongside 
a consensus that AI had failed to deliver on its early grandiose promises) delivered a big blow 
to the connectionist school of artificial intelligence’. In the years following Minsky and Papert 
[23] governmental and industry funding was pulled from connectionist research laboratories, 
ushering in the first ‘AI winter’®. 

Yet, as exemplified in Rosenblatt [§5.2, theorem 1; 27] it was known at the time that multilayer 
perceptrons could calculate non-linearly separable functions (such as the ‘exclusive or’). We can 
prove intuitively that a set of neurons can calculate any function: a perceptron can perfectly 
emulate a NAND gate (Fig. 4), and the singleton set {NAND} is functionally complete. Since we 
can combine a set of NAND gates to calculate any function, we must also be able to combine a set of 
neurons to calculate any function? . Such a group of neurons is known as the multilayer perceptron 
(MLP). Unfortunately, we cannot simply stack perceptrons together as we are missing one vital 
ingredient: a way to train the network! At the time of Minsky and Papert’s treatise on perceptrons 
there was no widely known algorithm [in the West; see 25] that could train such a multilayer 
network. In Minsky and Papert’s own words: 


Nevertheless, we consider it to be an important research problem to elucidate (or reject) our intuitive 
judgment that the extension [from one layer to many] is sterile. Perhaps some powerful convergence 
theorem will be discovered, or some profound reason for the failure to produce an interesting ‘learning 
theorem’ for the multilayered machine will be found. (§13.2; Minsky and Papert [23], on MLPs) 


The field had to wait almost two decades for such an algorithm to become widespread. In the 
next subsection we will explore backpropagation, the algorithm that ultimately proved Minsky 
and Papert’s intuition wrong. 


2(b) The multilayer perceptron 


Grouping many artificial neurons together may result in something resembling Fig. 5. This 
network consists of an input layer, two intermediate ‘hidden’ layers, and an output layer. As 
in the previous section, let us say that we want a classifier that can classify a set of galaxy images 
into elliptical and spiral types. In an MLP similar to Fig. 5 a neuron would be assigned to each 
pixel in a galaxy image. Each neuron would take the numeric value of that pixel, and propagate 
that signal forward into the network. The next layer of neurons does the same, with the input 
being the previous layer’s output. This process continues until we reach the output layer. In a 
binary classification task like our galaxy classifier this layer outputs a value between zero and 
one. Thus, if we define a spiral galaxy as zero, and an elliptical galaxy as one, we would want the 
network output to be near zero for a spiral galaxy input (and vice versa). 

In §2(a) we found the change we needed to apply to a single neuron’s weights to make it learn 
from a training example. We can train an MLP in a similar way by employing the reverse mode 
of automatic differentiation (or backpropagation) to learn from our galaxy training data set [36- 
38]!°. We want our network to learn when it makes both a correct and incorrect prediction, so we 
define our activation function as a smoothed version of Rosenblatt’s perceptron activation. This 
ensures that a signal is present in the derivative no matter which values are input. This activation 
7See Metz [24] for a closer look at the conflicts and personalities that shaped AI. 

8 At least, in the Western world. Connectionism continued in earnest in the Soviet Union [25,26]. 

°More formally, Cybenko [28] and Hornik et al. [29] prove that an infinitely wide neural network can calculate any function, 
and Lu et al. [30] prove that an infinitely deep neural network is a universal approximator. 

Some controversy surrounds backpropagation’s discovery. The Finnish computer scientist Linnainmaa proposed the reverse 
mode of automatic differentiation and adapted the algorithm to run on computers in their 1970 (Finnish language) thesis [39]. 
They first published their findings in English in 1976. Werbos then proposed applying an adaptation of Linnainmaa’s method 
to artificial neural networks. Rumelhart et al. [38] showed experimentally that backpropagation can generate meaningful 


internal representations within a neural network, and popularised the method. Here we will err on the side of caution and 
cite all three manuscripts. For further reading we recommend Schmidhuber [40] and Baydin et al. [41]. 
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£1 T2 ~(zxı A x2) p= H(w-x) 

0 0 1 H(1.5 + (-1)-O+ (-1)-0)=1 
0 1 1 H(1.5 + (—1)-O+ (-1)-1)=1 
1 0 1 H(1.5 + (—1)-1+ (—1)-0)=1 
1 1 0 H(1.5 + (-1)-1+ (—1)-1)=0 


Figure 4: If we define H(w-x) as in Eq. 2.1 we can set a perceptron’s weights so that it is 
equivalent to the NAND gate. 


Figure 5: The multilayer perceptron, or artificial neural network. The depicted network has two 
hidden layers. It takes N inputs 71, £2,..., £y, and outputs a prediction pz. Note that here we 
omit the explicit bias terms (i.e. wo). 
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Figure 6: A curated selection of activation functions. In all plots, the x axis is the input, and the y 
axis is the output. The rectified linear unit (ReLU) activation function was first introduced in the 
context of neural networks in Fukushima [31] and later rediscovered, named, and popularised 
in Nair and Hinton [32]. The exponential linear unit (ELU), Swish and Mish activations were 
respectively introduced in Clevert et al. [33], Ramachandran et al. [34], and Misra [35]. 


function is known as the ‘sigmoid’ function, and is shown in Fig. 6. As in §2(a) we define a loss 
function L(y, p) that describes the similarity between a ground truth (y) and a prediction (p). 
We also define a neuron’s activation function as (w - x) where w - x is the weighted sum of a 
neuron’s inputs. Following from Eq. 2.2: 

OL _ OL Op; 

Ow] 7 Op] Ow] 
where | is a layer in the MLP. In the same way as in §2(a) we can calculate an MLP’s final layer’s 
(l = L) weight updates in terms of known values: 


oL _ OL eee H 
awr Opr © (eLpr-1) , (2.3) 


where py_1 are the outputs from the previous layer. To calculate the (L — 1)th layer’s weight 
updates we use the chain rule: 


OL ƏL ðpL PL- 
ðwzr-ı pr ÔPL-1 Owy_-1 


Likewise for the (L — n)th layer: 


OL ƏL jp cet OPL—n 
Owr-n Opt (jj OPL-i OWL-n 


Now we can start plugging in some known values. Since p; = yı(w; : pı—1), it follows that 
dp; /Opi_-1 = yw], and ðpı/ðwı = yip} ı. So: 


aL aL L 
-Zo GL «wL 7 (or nPL—n 1). (2.4) 
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Combining Eq. 2.3 with Eq. 2.4 we get the weight update algorithm for the (L — n)th layer of the 
MLP: 
ge © ?LPL-1) ; forn=0, 


OL 


n 1 F 1 T f 0 (2.5) 
Opt © Ili=1 PL-iWL-i) (PL-nPL-n-1); forn>0. 


Wnext = W — 7) 


With this equation! in hand we can use the same technique described earlier in this section and 
in §2(a) to update the network’s weights with each galaxy image to decrease the loss function 
L. Again, as £ is minimised, our MLP will classify our elliptical and spiral galaxy images with 
increasing accuracy. 


3. Astronomy’s first wave of connectionism 


Connectionism was first discussed within astronomy in the late 1980s, after the popularisation 
of backpropagation (see footnote 10) and the consequent passing of the first ‘AI winter’. Two 
radical studies emerged in 1988 that recognised areas where astronomy could benefit from the use 
of ANNs [42,43]. Together, they identified that astronomical object classification’? and telescope 
scheduling could be solved through the use of an ANN. These studies were followed by a rapid 
broadening of the field, and the application of connectionism to many disparate astronomical use 
cases [16, and references therein]. In this section, we will outline areas where MLPs found an early 
use in astronomy. 


3(a) Classification problems 


Odewahn et al. [44] classified astronomical objects into star and galaxy types. These were taken 
from the Palomar Sky Survey Automated Plate Scanner catalogue [45]. To compile their dataset, 
they first extracted a set of emergent image parameters from the scanned observations. These 
parameters included the diameter, ellipticity, area, and plate transmission. The parameters were 
then used to train both a linear perceptron and a feedforward MLP to classify the objects into stars 
or galaxies. Odewahn et al. [44] found that their best performing model could classify galaxies 
with a completeness of 95% for objects down to a magnitude < 19.5. This work was followed by 
many more studies on the star/ galaxy classification problem [e.g. 46-49]. Galaxy morphological 
type classification was explored in the early 1990s. Storrie-Lombardi et al. [50] describe an MLP 
that takes an input a selected set of thirteen galaxy summary statistics, and uses this information 
to classify a galaxy into one of five morphological types. Storrie-Lombardi et al. [50] report a 
top one accuracy of 64%, and a top two accuracy of 90%. This pilot study was followed by 
several studies from the same group that confirmed that MLPs are effective automatic galaxy 
morphological classifiers [51-56, see §5 for a continuation of this line of research]. 

MLPs were also used in other classification tasks; here we highlight a few further areas 
where MLPs were applied. von Hippel et al. [57] classified stellar spectra into temperature types, 
and Klusch and Napiwotzki [58] did the same for Morgan-Keenan System types. Chon [59] 
described the use of an MLP to search for and classify muon events (and therefore neutrino 
observations) in the Sudbury Neutrino Observatory. Quasar classification has been explored in 
several studies [60-62]. Seminally, Carballo et al. [60] used an MLP to select quasar candidates 
given their radio flux, integrated-to-peak flux ratio, photometry and point spread function in 
the red and blue bands, and their radio-optical position separation. They found good agreement 
between their model and that of the decision tree described in White et al. [63], confirming 
MLPs as a competitive alternative to more traditional machine learning. As part of the Supernova 
photometric Classification Challenge [SPCC; 64], Karpenka et al. [65] proposed the use of a neural 
network to classify supernovae into Type-la/non-Type-la classes. To classify their light curves, 
"T¢ we examine Eq. 2.5 carefully, we can see why we add nonlinearities between the MLP layers; without activation functions 
Eq. 2.5 collapses to the equivilent of a single layer MLP! 


Specifically, galaxies were discussed in Rappaport and Anderson [42] and point sources observed with the Infra-Red 
Astronomical Satellite (IRAS) were discussed in Adorf and Johnston [43]. 
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they first used a hand-crafted fitting function, and then trained their MLP on the fitted coefficients. 
They found that their model was competitive with other, more complex models trained on the 
SPCC dataset. From the studies discussed in this section we can safely conclude that MLPs are 
effective classifiers of astronomical data, when given important parameters extracted by an expert 
guide. 


3(b) Regression problems 


MLPs have also been used in regression problems. Angel et al. [66] applied them first to 
adaptive telescope optics. They trained their MLP on 250 000 simulated in focus and out of focus 
observations of stars as seen by the Multiple Mirror Telescope (MMT). From the flattened 13 x 13 
pixel observations, their network predicted the piston position and tilt required for each of the 
MMT’s mirrors to bring the stars into focus. After the application of these corrections, the authors 
were able to recover the original profile. In follow up studies, Sandler et al. [67] and Lloyd-Hart 
et al. [68] proved that Angel et al.’s MLP worked on the real MMT. 

Photometric redshift estimation was explored in many concurrent studies [e.g. 56,69—72]. Firth 
et al. [69] train a neural network to predict the redshift of galaxies contained in the Sloan Digital 
Sky Survey (SDSS) early data release [73]. The galaxies were input to the neural network as a set 
of summary parameters, and the output was a single float representing the galaxy redshift. They 
found their network attained a performance comparable to classical techniques. Extending and 
confirming the work by Firth et al. [69], Ball et al. [56] used an MLP to predict the redshift of 
galaxies contained in the SDSS’s first data release [74]. They also showed that MLPs were capable 
of predicting the galaxies’ spectral types and morphological classifications. 

Of course, MLPs have been used more widely in astronomical regression tasks. Here we will 
cherry pick a few studies to show the MLP’s early breadth of use. Sunspot maxima prediction 
was carried out by Koons and Gorney [75]. They found their MLP based method was capable 
of predicting the number of sunspots when trained on previous cycles. Bailer-Jones et al. [76] 
predicted the effective temperature of a star from its spectrum. Auld et al. [77] and Auld et al. 
[78] applied MLPs to cosmology, demonstrating that MLPs are capable of predicting the cosmic 
microwave background, given a set of cosmological parameters. Nergaard-Nielsen and Jorgensen 
[79] used an MLP to remove the foreground from microwave temperature maps. From the studies 
discussed in this section we can see that MLPs are effective regressors of astronomical data, when 
given significant parameters extracted by an expert guide. 


4. Contemporary supervised deep learning 


There are some issues with MLPs. Primarily they do not scale well to high dimensional datasets. 
For example, if our dataset consists of images with a 128 x 128 pixels, we will need 16 384 neurons 
in the MLP’s input layer alone! As we move into the hidden layers, this scaling issue only gets 
worse. Also, since MLPs must take an unrolled image as an input, they disregard any spatial 
properties of their training images, and so either need a substantial amount of training data to 
classify or generate large images! , or an expert to extract descriptive features from the data in 
a preprocessing step. We can see this issue writ large in the previous section—most of the MLP 
applications described in §3 require an expert to extract features from the data for the network to 
then train on! This drawback is not ideal; what if there are features within the raw data that are 
not present in these cherry picked statistics? In that case, it would be preferable to let the neural 
network take in the raw data as input, and then learn which features are the most descriptive. We 
will discuss neural network architectures that solve both the MLP scaling problem and the expert 


13At the height of the convolutional neural network architecture’s popularity in the mid 2010s these were real problems. 
However, with the growth of computing power and data in recent years we are seeing a resurgence of the more general MLP 
model [e.g. 80-83]. This follows the prevailing trend in AI where the removal of human-crafted features and biases ultimately 
results in more expressive models that learn such features and biases dieectly from data [84,85]. 
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reliance problem in this section. After we have explored these architectures in general, we will 
discuss their application to astronomical problems in §5. 


4(a) Convolutional neural networks 


Unlike the MLP described in the previous section, convolutional neural networks (CNNs; 
introduced in Fukushima [31] and first combined with backpropagation in LeCun et al. [86]) do 
not entirely consist of fully connected layers, where each neuron is connected to every neuron in 
the previous and subsequent layers. Instead, the CNN (such as the one depicted in Fig. 7) uses 
convolutional layers in place of the majority (or all) of the dense layers. 


Figure 7: A convolutional neural network classifying a spiral galaxy image. 


We can think of a convolutional layer as a set of learnt ‘feature filters’. These feature filters 
perform a local transform on input imagery. In classical computer vision, these filters are hand 
crafted, and perform a predetermined function, such as edge detection or blurring. In contrast, 
a CNN learns the optimal set of filters for its task (say, galaxy classification). Eq. 4.1 shows two 
different convolution! operators being performed on an array. 


39 57 86 9 26 


90 74 63 87 98| [o o 0] [26 16 46 

79 34 26 16 46/x]0 o oJ =|96 1. 79 

67 61 96 1 79| Jo 0 1 15 49 29 

33 47 15 49 29 

(4.1) 

39 57 86 9 26 

90 74 63 87 98| fı 0 139 136 219 

79 34 26 16 46/x/0 1 0] =|220 101 158 

67 61 96 1 79| lo 1 155 179 56 


33 47 15 49 29 


In the above equation the operation is represented as a matrix. In a CNN the matrix is a set 
of neuronal weights. As seen in Fig. 7 there are multiple feature maps in a convolutional layer, 
each containing a set of weights independent to the other feature maps, and learning to extract a 
different feature. As in the MLP described in the previous section, the weights are updated using 
backpropagation to minimise a loss function. We will discuss astronomical applications of CNNs 
in §5, after we introduce modern CNN architectures. 

M4 All astronomical objects shown in the neural network diagrams within this manuscript are generated via text prompts fed 
into a latent diffusion neural network model [87]. 

'5We must note that in Eq. 4.1 we follow most deep learning libraries and perform a cross-correlation and not a convolution. 


However, since the weights are learnt, this does not matter; the neural network will simply learn a flipped representation of 
the cross-correlation. 
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4(b) Recurrent neural networks 


Standard feedforward neural networks like the MLP (§2(b)) and CNN (§4(a)) generate a fixed 
size vector given a fixed size input'®. But, what if we want to classify or generate a variably sized 
vector? For example, we might want to classify a galaxy’s morphology given its rotation curve. 
A rotation curve describes the velocity of a galaxy’s visible stars versus their distance from the 
galaxy’s centre. Fig. 8 shows a possible rotation curve for Messier 81. A rotation curve’s length 
depends on the size of its galaxy, and due to this variable length, and the fact that MLPs take 
a fixed size input, we cannot easily use an MLP for classification. Recurrent neural networks 
(RNNs), however, can take a variable length input and produce a variable length output. An 
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Figure 8: An example of a galaxy rotation curve, plotted over an image of Messier 81 [89]. 


RNN differs from a feed forward MLP by having a hidden state that acts as a ‘memory’ store of 
previously seen information. As the RNN encounters new data, its weights are altered through 
the backpropagation through time algorithm [BPTT; 90, and references therein. Also see footnote 
10]. 

We can use an RNN similar to Fig. 9 to classify our rotation curves. We express the rotation 
curve as a list {x1, £2,...,%&y }, with each x being a measurement of the rotational velocity at 
a certain radius. Then we feed this list into the RNN sequentially in the same way as shown 
in Fig. 9. The RNN will produce an output for each x fed to it, but we ignore those until we 
feed in zy, the rotational velocity furthest from the galaxy’s centre. When we feed in xy, the 
RNN produces a prediction py, which we can then compare to a ground truth yy via a loss 
function £y. In our case, y is an integer label representing the galaxy’s morphological class. The 
comparison Ly (yy, py) is a function that represents the distance between the RNN prediction 
and the ground truth. We can then reduce £y (yn, py ) by updating the RNN’s weights through 
BPTT so that the weights {wz, wp, Wp} follow V£y downwards. As we do this, our RNN will 
improve its galaxy classifications. 


16 As with any rule there are exceptions, such as CNNs containing a global average pooling layer [88]. 
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Figure 9: A recurrent neural network with weights {wz,wp, wn}, a hidden state hn, inputs x, 
and a prediction p,—y is unrolled into its constituent processes. 


BPTT’s mathematical derivation is akin to the one we explored in §2(b), and we will quickly 
derive it here for posterity. Let us first look at the forward propagation equations: 


Ln = lyn — pn, 


hn = Pwa -hn-1 + We - Xn). 


From these we see that we need to express 0Ln/OWp, O£Ln/Ow;),, and O£Ln/Owz as known values 
to train the network. 0Ly/Owp is relatively easy; via the chain rule, and the fact that Opn /Owp = 
yhi: 
OLn _ OLn Opn 
ðwp Opn Owy’ 
OLn, 
7 Opn 


©y'h. (4.2) 


Ln /ƏWp is more tricky, so we will go step by step. We already know that 


OLn B OLn Opn Ohn 
wp, Opn Ohn Own, 


(4.3) 


However, we see in Fig. 9 that hn depends on h,,_1, which depends on h,,_2 (and so on). We also 
notice that all the hidden states depend on w,. We therefore rewrite Eq. 4.3 to make this explicit: 


Ln _ OLn Opn > Ahn Ohj 


Wp Opn Ohn jz oh; Ow,” 


_ OLn Opn z z Oh; Oh; 


7 Opn Ohn j=l Sei Oh;y_1 Own, : 


We can now substitute in some known values: 


ƏLn  ƏLn n rË PESE ae 
7 a i | $ hj-1- 4.4 
OWhn, Opn OP Dn 2 i I] (y Whi o] j—1 ( ) 
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Finally, 0£Ln/Owz is derived in the same way as 0L7,/Ow): 


n 


OLn _ OLn Opn y dh; \ dh; 
Owx Opn Ohn $a aa Oh;_1 | Owx’ 
OL n n 
2 a © g'h? 5 Il whi ox; . (4.5) 
Pn j=l \i=j+1 


With 0£Ln/Owp, OLn/Owy, and ALn /ƏWx in hand we can apply the same update rule shown in 
Eq. 2.5. 

Aside from many-to-one encoding, RNNs can produce many predictions given many inputs, 
or act similarly to an MLP and produce one or many outputs given a single input. We will discuss 
the application of recurrent neural networks to astronomical data in §5, after we introduce gated 
recurrent neural networks. 


4(c) Sidestepping the vanishing gradient problem 


In the early 1990s, researchers identified a major issue with the training of deep neural networks 
through backpropagation. Hochreiter first formally examined the ‘vanishing gradient’ problem 
in their diploma thesis (Hochreiter [91], see also later work by Bengio et al. [92]). Due to the 
vanishing gradient problem, it was widely believed that training very deep artificial neural 
networks from scratch via backpropagation was impossible. In this section we will explore what 
the vanishing gradient problem is, and how contemporary end-to-end trained neural networks 
sidestep this issue. 
First let us remind ourselves of the sigmoid activation function introduced in Fig. 6: 


y(x) =1/(l—e"*). (4.6) 


Eq. 4.6 and its accompanying plot shows the output of a sigmoid function y and its derivative P, 
when given an input x. 

Now, let us revisit the weight update rule for the (L — n)th layer of a feedforward MLP 
(Eq. 2.4): 


aL aL 2 
=-— © ( otiw) Ce ei (4.7) 


lim f t o WwWT = 0 
lim H PL-iWL-i 


If y’ is typically less than one (as in Eq. 4.6 and most other saturating nonlinearities) the product 
term in the above equation becomes an issue. In that case, we can see that the product rapidly 
goes to zero as n (the number of layers) becomes large”. If we study Eq. 4.4, we can see the same 


“Likewise, if y’ is typically greater than one, the product term rapidly ‘explodes’ to infinity. This is known as the ‘exploding 
gradient’ problem, also first identified in Hochreiter [91]. 
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problem also plagues RNNs as we backpropagate through hidden states: 


n n 
© y'hZ II @wia) ohna (4.8) 
j=l i=j+1 


Ln — Ln 
Own, 7 Opn 


lim Mi=; b'wy ;=0 


Let us solidify this issue by reminding ourselves of Eq. 2.5—the weight update rule for a 
network trained through backpropagation: 


(4.9) 


Wnext = W — 7) 


Ow 
Combining Eq. 4.9 and the limits defined in Eq. 4.7 and Eq. 4.8 results in the below weight update 
rule in the limit n > oo. 


lim Whext = W. (4.10) 
noo 


Eq. 4.10 shows that learning via backpropagation slows as we move deeper into the network. 
This problem once again caused a loss of faith in the connectionist model, ushering in the second 
AI winter. It took until 2012 for a new boom to begin. In the following three subsections we will 
explore some of the proposed partial solutions to the vanishing gradient problem and show how 
they came together to contribute to the current deep learning boom. 


4(c.i) Non-saturating activation functions 


We can see in Eq. 4.8 and Eq. 4.7 that if y’ =1 then the product term does not automatically 
go to zero or infinity. If this is the case, why not simply design our activation function around 
this property? The rectified linear unit [ReLU; 31,32] is an activation function that does precisely 
this!®: 


ReLU(x) = max(x, 0). (4.11) 


—1 0 1 
x 


The gradient of ReLU is unity if the inputs are above zero, exactly the property we needed to 
mitigate the vanishing gradient problem. Similar non-saturating activation functions also share 
the ReLU gradient’s useful property, see for example the Exponential Linear Unit, Swish, and 
Mish functions in Fig. 6. 


A(c.ii) Graphics processing unit acceleration 


If we can speed up training, we can run an inefficient algorithm (such as backpropagation 
through saturating activations) to completion in less time. One way to speed up training is by 


'8ReLU’ is always zero if its inputs are < 0, removing any signal for further training. This is known as the ‘dying ReLU’ 
problem, but is not as big of an issue as it first seems. Since contemporary deep neural networks are greatly overparameterised 
(see for example Frankle and Carbin [93] and other work on the ‘lottery ticket hypothesis’) backpropagation through the 
ReLU activation function can act as a pruning mechanism, creating sparse representations within the neural network and 
thus reducing training time even further [94]. 
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using hardware that is specifically suited to the training of neural networks. Graphics processing 
units (GPUs) were originally developed to render video games and other intensive graphical 
processing tasks. These rendering tasks require a processor capable of massive parallelism. We 
have seen in the previous sections that neural networks trained through backpropagation also 
require many small weight update calculations. With this in mind, it is natural to try to accelerate 
deep neural networks using GPUs. 

In 2004, Oh and Jung [95] were the first to use GPUs to accelerate an MLP model, reporting 
a 20x performance increase on inference with a ATI RADEON 9700 PRO GPU accelerated 
neural network. Shortly after, Steinkrau et al. [96] showed that backpropagation can also 
benefit from GPU acceleration, reporting a three-fold performance increase in both training 
and inference. These two breakthroughs were followed by a flurry of activity in the area [e.g. 
97-100], culminating in a milestone victory for GPU accelerated neural networks at ImageNet 
2012. AlexNet [101] won the ImageNet classification and localisation challenges [102], scoring an 
unprecedented top-5 classification error of 16.4%, and a single object localisation error of 34.2%. 
In both challenges AlexNet scored over 10% better than the models in second place. Krizhevsky 
et al.’s winning network was a CNN [31] trained through backpropagation [36,86], with ReLU 
activation [32], and dropout [103] as a regulariser”’. The performance increase afforded by GPU 
accelerated training enabled the network to be trained from scratch via backpropagation in a 
reasonable amount of time. The discovery that it is possible to train a neural network from scratch 
by using readily available hardware ultimately resulted in the end of connectionism’s second 
winter, and ushered in the Cambrianesque deep learning explosion of the mid-to-late 2010s and 
the 2020s (Fig. 10). 


A(c.iii) Gated recurrent neural networks and residual networks 


The long short-term memory unit [LSTM; 105,106]°° mitigates the vanishing gradient problem by 
introducing a new hidden state, the ‘cell state’ (cn), to the standard RNN architecture. This cell 
state allows the network to learn long range dependencies, and we will show why this is the case 
via a brief derivation”'. First, as always, let us study Fig. 11 and write down the forward pass 
equation for updating the cell state: 


Cn = f(en-1, hn-1, Xn) + g(hn—1, Xn) 
where f(€n—1, bn—1,Xn) =€n—1 © Y(hn—1, Xn). For brevity we define yn = y(hn—1, Xn). 


Like the RNN case (Eq. 4.4 and Eq. 4.5), we will need to find 0en/Ocn—1 to calculate VL. 
Therefore, 


OCn _ Of (en-1, hn-1, Xn) Og(hbn-1 0 
F , 
OCn—1 OCn—1 Cn—1 
z OCn—1 © Yn 
OCn—1 
1 
Cn—1 Ov Oey, yn, 
nm—1 m—1 


Dropout reduces the amount of neural network overfitting - where a network performs well on the training set at the 
expense of performance on data it has not yet seen. One performs dropout by randomly removing a set of neurons at each 
training step, and using all neurons at test time. This set up essentially trains a large ensemble of sub-models, whose average 
prediction outperforms that inferred by a single model. 

*°Compare also the gated recurrent unit [GRU; 107]. 

2!Here we loosely follow Bayer [108, §1.3.4]. 
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==- Doubling every 15 months 
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Figure 10: If we plot the total number of floating point operations (FLOPs) required to train a 
neural network model, and compare it to the model’s publication date, we can see a change in 
trend at around 2012. This corresponds to the popularisation of GPU accelerated training of very 
deep neural networks, with 2012 demarcating AI’s ‘Deep Learning Era’ and the beginning of 
astronomy’s second wave of connectionism (§5). Data is taken from Sevilla et al. [104]. 


--- Cn-2 Cn+1 --- 


LSTM 


--- hn-2 hnqi --- 


Figure 11: A set of sequential data xn is input into an LSTM network. Inside the cell O denotes 
elementwise operations and O denotes neuronal layers. p is the sigmoid activation function, 
and Tanh is the hyperbolic tangent activation function. © is an elementwise addition, © is the 
Hadamard product, and line mergers are concatenations. €n is the cell state, and hn is the hidden 
state. 


(0000000 as ued "sos" Biobuusiandiisosjeiorsos: 


Thus, if we want to backpropagate to a cell state deep in the network we must calculate 


= Jles n>N. (4.12) 
= 


The product term above does not depend on the derivative of a saturating activation function, 
and so does not automatically vanish as N goes to oo. This means that a gradient signal can be 
carried through the LSTM cell state without losing amplitude and vanishing”. 

We can use a technique derived from the LSTM to solve our vanishing gradient problem 
for deep feedforward neural networks (as studied in §2(b)). Srivastava et al. [111] do this by 
applying the concept of the LSTM’s cell state to their deep convolutional ‘highway network’. 
The highway network uses gated connections to modulate the gradient flow back through 
neuronal layers. Later work by He et al. [112] introduces the residual network (ResNet) by taking 
a highway network and simplifying its connections. They apply an elementwise addition (or 
‘residual connection’) in place of the highway network’s gated connection (Fig. 12a). One can go 
even further with residual connections, as Ronneberger et al. [113] demonstrate with their U-Net 
model. The U-Net combines residual connections with an autoencoder-like architecture (Fig. 12b). 
The U-Net has gone on to become the de facto network for many tasks that require an input and 
output of the same size (such as segmentation, colourisation, and style transfer). 


Encoder Decoder 
q(z|x) p(&|z) 


(a) A single residual connection is (b) The U-Net, a network that was originally developed to 
applied within a neural network. segment biological imagery uses the residual connection. 


Figure 12: The left subfigure shows the residual connection as originally introduced in He et al. 
[112]. The right subfigure shows an application of the residual connection to an autoencoder like 
architecture [113], in this case colourising an astronomical object. 


4(d) Translation, attention, and transformers 


Theoretically, gated RNNs (GRNNs) such as the LSTM can learn very long range dependencies 
(see Eq. 4.12 and its accompanying text). In practice, GRNNs tend to forget information about 
distant inputs. This is because the GRNN lacks unmediated access to inputs beyond the 
immediate antecedent as a consequence of its recurrent architecture. The problem is especially 
apparent in neural machine translation tasks that require knowledge of an entire sequence to 
produce an output, such as language to language translation. Fig. 13 shows such a sequence to 
Which is great in theory. In practice, LSTMs still have trouble learning very long range dependencies due to their reliance 


on recurrent processing [109]. Transformer networks [110] are an architecture that uses the concept of attention to address 
this issue. We will discuss transformer networks in §4(d). 
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sequence [Seq2Seq; 109] model. Seq2Seq translates between two sets of sequential data by sharing 
a hidden state between two GRNN units. In Fig. 13 we can see that the shared information is 
bottlenecked by the hidden state. Therefore, to resolve the GRNN ‘forgetting problem’ we must 
find a way to avoid any recursion, or serial processing of input and output. We can do this by 
providing the neural network access to all input while it is calculating an output. This was the 
primary motivation behind the transformer architecture [110,114]. 


SOMO 
OO © 


Figure 13: A sequence to sequence [Seq2Seq; 109] model. A sequence x is input into a GRNN. The 
final hidden state (h) of the input network is then passed into a second GRNN. The second GRNN 
then unrolls to predict an output sequence p. Due to the hidden state acting as a intermediary, x 
and p need not be of equal length. 


Modern transformer architectures consist of a series of self-attention layers, often interspersed 
with other layer types”. Self-attention as described in Vaswani et al. [110] is shown in Fig. 14. 
Intuitively, it captures the relationships between quanta within a data input. To perform self- 
attention we first take an input sequence 


x= fes r2 = sn] , 


where x can be any sequence, such as a sentence, a variable star’s time series, or an unravelled 
galaxy image”. Here we will follow the literature and refer to [x1,..., £n] as tokens. As we can 
see in Fig. 14 the input is passed through a trainable pair of weight matrices Q and K. The output 
matrices q and kÝ are then multiplied together to yield 


Qimi kyr, Qia Ker. ++» QırıKntn 

i i Q2x2Kıxzı QaroKorq +--+ Q2r2Knzn 
(Q-x)(K-x)'=qk! = . . s ; . (4.13) 

Qntn Kızı QninKoxr2 ae QninKnain 


We can see that Eq. 4.13 describes the relationships between tokens within x. For example, if xı is 
similar semantically to x2, we would expect Q1 21 K2x2 and Qox2K 12 to havea high value. We 
then normalise qk' to mitigate vanishing gradients”, and apply a softmax nonlinearity so that 
the maximum weighting (or similarity) is one. 


3Tn the original transformer formulation described in Vaswani et al. [110], the network consisted of a connected ‘encoder’ 
and ‘decoder’ section much like a Seq2Seq model (Fig. 13). Later work has found this to be an unnecessary complication. 
For example, the generative pretrained transformer (GPT) 2 and 3 models [12,115] consist of only decoder layers, and the 
bidirectional encoder representations from transformers (BERT) model consists of only encoder layers [116]. 

4One can go very general with this, as DeepMind demonstrated with their ‘Gato’ transformer model [117]. Gato can predict 
sequences for myriad tasks, from operating a physical robotic arm, to completing natural language sentences, to playing Atari 


games. 
See Footnote 17. 
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Meanwhile, the input sequence x is passed through the neuronal layer V, resulting in a 
weighted representation v: 


V-x=v= [Viz Vomg + Vnan] . 


v is multiplied with the similarity matrix slakt /vn). This process weighs similar tokens within 
the sequence higher, increasing their relative importance in later neuronal layers. 


Figure 14: An input (x) is fed into a self-attention mechanism. The weights used to produce 
the query (q), key (k), and value (v) matrices are learnt via backpropagation. Here the learnt 
weights are denoted as the capitalised versions of their child matrices. q and k are normalised 
and multiplied together, and a softmax nonlinearity (ç) is applied. Finally, v is multiplied with 
output of the upper path and the final output is fed forward to the next neuronal layer. ®) denotes 
a matrix multiplication. 


5. Astronomy’s second wave of connectionism 


Compared to classical machine learning techniques”° deep learning as outlined in §4 does not 
require an extraction of emergent parameters to train its models. RNNs in particular are well 
suited to observing the full raw information within a time series. Likewise, CNNs are well suited 
to observing raw information within image-based data. Astronomy is rich with both types of data, 
and in this section we will review the history of the application of RNN, CNN, and transformer 
models to astronomical data. 


5(a) Recurrent neural network applications 


RNNs were first applied in astronomy very close to home; Aussem et al. [118] predicted 
atmospheric seeing for observations from ESO’s Very Large Telescope, and the prediction of 
geomagnetic storms given data on the solar wind was also explored in the mid-to-late 1990s and 
early 2000s ([119], [120], and other work from the same group; [121]). 


?6This includes most MLP applications in astronomy, see §3. 
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The first use of RNNs for classification in astronomy was carried out in in a prescient study 
by Brodrick et al. [122]. They describe the use of an RNN-like Elman network [123]. Their RNN 
was tasked with the search for artificially generated narrowband radio signals that resemble those 
that may be produced by an extraterrestrial civilisation. They found that their model had a test 
set accuracy of 92%, suggesting that RNNs could be a useful tool in the search for extraterrestrial 
intelligence. More than a decade after Brodrick et al. [122], Charnock and Moss [124] used an 
LSTM (Fig. 11) to classify simulated supernovae. They describe two classification problems. One, 
a binary classification between type-Ia and non type-Ia supernovae, and the other a classification 
between supernovae types I, II, and III. For their best performing model they report an accuracy 
of more than 95% for their binary classification problem, and an accuracy of over 90% for their 
trinary classification. This study cemented the usefulness of RNNs for classification problems in 
astronomy. Charnock and Moss [124] was followed by numerous projects studying the use of 
RNNs for classification of time series astronomical data. A non-exhaustive list of modern RNN 
use in astronomy includes: stochastically sampled variable star classification [125]; exoplanet 
instance segmentation [126]; variable star/galaxy sequential imagery classification [127]; and 
gamma ray source classification [128]. We must conclude from these studies that RNNs are 
effective classifiers of astronomical time series, provided that sufficient data is available. 

Of course, recurrent networks are not limited to classification; they can also be used for 
regression problems. First, Weddell and Webb [129] successfully used an echo state network [130] 
to predict the point spread function of a target object in a wide field of view. Capizzi et al. [131] 
used an RNN to inpaint missing NASA Kepler time series data for stellar objects. They found that 
their model could recreate the missing time series to an excellent accuracy, suggesting that the 
RNN could internalise information about the star it was trained on. As in the classification case, 
research into the use of RNNs for regression problems picked up massively in the late 2010s, and 
here we will highlight a selection of these studies that represent the range of RNN use cases. Shen 
et al. [132] used both an LSTM and an autoencoder based RNN to denoise gravitational wave data, 
and Morningstar et al. [133] used a recurrent inference machine to reconstruct gravitationally 
lensed galaxies. Liu et al. [134] used an LSTM to predict solar flare activity. From these studies, 
similarly to the classification case above, we can once again conclude that RNNs are effective 
regressors of astronomical time series. 

RNNs have also been used in cases that are a little more unconventional. For example, 
Kiigler et al. [135] used an autoencoding RNN (specifically an echo state network) to extract 
representation embeddings of variable main sequence stars. They find that these embeddings 
capture some emergent properties of these variable stars, such as temperature, and surface 
gravity, suggesting that clustering within the embedding space could result in semantically 
meaningful variable star classification. We will revisit this line of research when we explore 
representation learning within astronomy in detail in §8. An example of more drastic cross- 
pollination between ideas within deep learning and those within astronomy is Smith et al. [136]. 
They use an encoder-decoder network comprising of a CNN encoder and RNN decoder to predict 
surface brightness (SB) profiles of galaxies. This class of neural network was previously used 
extensively within natural language image captioning, and by treating SB profiles as ‘captions’ 
their model was capable of prediction over 100x faster than the previous classical, human-agent 
based method. 


5(b) Convolutional neural network applications 


It did not take long after Krizhevsky et al. [101] established CNNs as the de facto image 
classification network for astronomers to take notice: in 2014 they were applied in the search 
for pulsars [137] as part of an ensemble of methods. Zhu et al. [137] found that their ensemble 
was highly effective, with 100% of their test set pulsar candidates being ranked within the top 
961 of the 90008 test candidates. Shortly after, Hála [138] described the use of one dimensional 
CNNs for a ternary classification problem. They find that their model is capable of classifying 1D 
spectra into quasars, galaxies, and stars to an impressive accuracy. CNNs have been also been 
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extensively used in galaxy morphological classification. First on the scene was Dieleman et al. 
[139]. They used CNNs to classify galaxy morphology parameters as defined in the Galaxy Zoo 
dataset [140] from galaxy imagery. They observed their galaxies via the SDSS, and found a 99% 
consensus between the Galaxy Zoo labels, and the CNN classifications. Huertas-Company et al. 
[141] showed that Dieleman et al.’s CNN is equally applicable to morphological classification of 
galaxies in the CANDELS fields [142]. Likewise, Aniyan and Thorat [143] showed that CNNs 
are capable of classifying radio galaxies. The combined work of Dieleman et al. [139], Huertas- 
Company et al. [141], and Aniyan and Thorat [143] confirms that CNNs are equally applicable 
to visually dissimilar surveys, with little-to-no modification. Looking a little further afield, Wilde 
et al. [144] used a deep CNN model to classify simulated lensing events. They also applied some 
interpretability techniques to their data, using occlusion mapping [145], gradient class activation 
mapping [146], and Google’s DeepDream to prove that the CNN was indeed classifying via 
observing the gravitational lenses. Alternative CNN models have also been used, such as the 
U-Net (Fig. 12b). The U-Net was initially developed to segment biological imagery [113]. Its 
first use in astronomy was related: Akeret et al. [147] use a U-Net [113] CNN to isolate via 
segmentation, and ultimately remove, radio frequency interference from radio telescope data. 
Likewise, Berger and Stein [148] used a three dimensional U-Net [V-Net; 149] to predict and 
segment out galaxy dark matter haloes in simulations, and Aragon-Calvo [150] used a V-Net 
to segment out the cosmological filaments and walls that make up the large scale structure of 
the Universe. Hausen and Robertson [151] demonstrate that a U-Net is capable of performing 
pixelwise semantic classification of objects in HST/CANDELS imagery, thus proving that U-Nets 
are capable of useful work directly within large imaging surveys, particularly in the deblending 
of overlapping objects, which is a perennial challenge in deep imaging. The U-Net in Lauritsen 
et al. [152] is used to superresolve simulated submillimetre observations. They found that the 
U-Net could successfully do this when using a loss comprising of the L1 loss, and a custom loss 
that measures the distance between predicted and ground truth point sources. We must conclude 
from the studies described in this subsection that CNNs are effective classifiers and regressors of 
image-based astronomical data. 


5(c) Transformer applications 


Although initially used for natural language, transformers have also been adapted for use in 
imagery, first by Parmar et al. [153], and also in Dosovitskiy et al. [13]. To our knowledge, 
transformers have not yet been applied to astronomical imagery, but they have started to 
find use in time-series astronomy. Donoso-Oliva et al. [154] used BERT [116] to generate a 
representation space for light curves in a self-supervised manner. Morvan et al. [155] use an 
encoding transformer to denoise light curves from the Transiting Exoplanet Survey Satellite 
[TESS; 156], and show that the denoising surrogate task results in an expressive embedding 
space. Pan et al. [157] also use a transformer model to analyse light curves for exoplanets. 
Transformers have taken the fields of natural language processing and computer vision by storm 
(§9), and so if we extrapolate from trends in other fields we expect to see many more examples of 
transformers applied to astronomical use cases in the near future. We will revisit the transformer 
architecture in the context of foundation models [158, and references therein], and their possible 
future astronomical applications in §9. 


5(d) A problem with supervised learning 


Supervised learning requires a high quality labelled dataset to train a neural network. In turn, 
these datasets require labourious human intervention to create, and so supervised data is in short 
supply. One can avoid this issue by letting the deep learning model infer the data labels itself, and 
project these labels onto a hidden descriptive ‘latent space’. Indeed, all of the networks described 
previously in this review can be repurposed for non-supervised tasks, and in §6 and §7 we will 
explore some deep learning frameworks that do not require supervision. 
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6. Deep generative modelling 


In this section we discuss generative modelling within the context of astronomy. Unlike 
discriminative models, generative models explicitly learn the distribution of classes in a dataset 
(Fig. 15). Once we learn the distribution of data, we can use that knowledge to generate new 
synthetic data that resembles that found in the training dataset. In the following subsections we 
will explore in detail three popular forms of deep generative model: the variational autoencoder 
(§6(a)); the generative adversarial network (§6(b)); and the family of score-based (or diffusion) 
models (§6(c)). Finally, in §8 we discuss applications of deep generative modelling in astronomy. 


Figure 15: On the left we see a generative model attempting to learn the probability distributions 
of a dataset that contains a set of galaxies, and a set of stars. On the right is a discriminative model, 
which is attempting to learn the boundary that separates the star and galaxy types. 


6(a) (Variational) autoencoders 


Autoencoders have long been a neural network architectural staple. In a sister paper to 
backpropagation’s populariser, Rumelhart et al. [159] demonstrate backpropagation within 
an autoencoder. Fig. 16 demonstrates the basic neural network autoencoder architecture. An 
autoencoder is tasked with recreating some input data, squeezing the input information (x) into 
a bottleneck latent vector (z) via a neural network q(z|x). z is then expanded to an imitation of 
the input data (£) by a second neural network p(%|z). The standard autoencoder is trained via 
a reconstruction loss; £p(x, £), where £ p(x, £) measures the difference in pixelspace between x 
and x. 

Naively, one would think that once trained, one could ‘just’ sample a new latent vector, 
and produce novel imagery via the decoding neural network p(X|z). We cannot do this, as 
autoencoders trained purely via a reconstruction loss have no incentive to produce a smoothly 
interpolatable latent space. This means we can use a standard autoencoder to embed and retrieve 
data contained in the training set, but cannot use one to generate new data. To generate new 
data we require a smooth latent space, which variational autoencoders (VAEs; Fig 17) produce by 
design [160]. 

A VAE differs from the standard autoencoder by enforcing a spread in each training set 
samples’ latent vector. We can see in Fig. 17 how this is done; instead of directly predicting z 
the encoder q predicts two vectors, p and ø. z is then sampled stochastically via the equation 


z=u+0 0e, (6.1) 
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Encoder Decoder 
q(2|x) p(&|z) 


Figure 16: An autoencoder [159] attends to an image of a black hole. z is a latent vector and x is a 
sample from a training set. The encoder, q learns to encode the incoming data into a latent vector 
while the decoder p takes as input z and attempts to recreate x. 


q(2|x) p(&|z) 


Figure 17: A variational autoencoder [160] operates on a spiral galaxy. z is a latent vector and 
x is a sample from the training set. The encoder, g learns to compress the incoming data into a 
latent vector that encodes the normal distribution. The decoder p takes as input z and attempts to 
recreate x. 


where © is the Hadamard product, and e is noise generated externally to the neural network 
graph”. This spread results in similar samples overlapping within the latent space, and therefore 
we end up with a smooth latent space that we can interpolate through. However, currently there 
is no incentive for the neural network to provide a coherent, compact global structure in the latent 
space. For that we require a regularisation term in the loss. This regularisation is provided via the 
Kullback-Leibler (KL) divergence, which is a measure of the difference between two probability 
distributions. A standard VAE uses the KL divergence to push the latent distribution towards the 
standard normal distribution, incentivising a compact, continuous latent space. Hence, the final 
VAE loss is a combination of the reconstruction loss and KL divergence: 


Lyar = LR(X, $) + KL(q(x|z)||p), (6.2) 


where p is some prior. In a standard VAE p = N (0, 1). 


6(b) Generative adversarial networks 


Generative adversarial networks [GAN; 161] can be thought of as a minimax game between 
two competing neural networks. If we anthropomorphise we can gain an intuition for how a 
GAN learns: let us imagine an art forger, and an art critic. The forger wants to paint paintings 
that are similar to famous expensive works, and needs to fool the critic when selling these 


2”To avoid breaking the backpropagation chain the VAE injects noise via an external parameter, e. This is described in Kingma 
and Welling [160] as the ‘reparameterisation trick’. 


(0000000 ‘19s uedo ‘90g “H Bio-Bulysi|gndAtelo0s|ehox'sos! 


paintings. Meanwhile, the critic wants to ensure that no reproductions are sold and so they need to 
accurately determine whether any painting is an original or a reproduction. At first, our forger is 
a poor painter, and so the critic can easily identify our forger’s works. However, the forger learns 
from the critic’s choices and produces more realistic paintings. As the forger’s paintings improve, 
the critic also learns better methods for detecting forgeries. This minimax game incentivises the 
critic to keep improving their classifications, and the forger to keep improving their painting. If 
this continues, we get to a point where the forger’s works are indiscernible from the real thing— 
the forger has learnt to perfectly mimic the dataset! Ina GAN, we name the critic the discriminator 
(D), and we name the forger the generator (G). 

In Goodfellow et al.’s original GAN formulation (Fig. 18a), G and D compete during training 
in a minimax game where G aims to maximise the probability of D mispredicting that a generated 
datapoint is sampled from the real dataset. G takes as input a randomly sampled latent vector z, 
and outputs a synthetic datapoint G(z). D takes either this synthetic datapoint, or a real datapoint 
x, and outputs D(G(z)) or D(x). This output is the probability that the datapoint is drawn from 
the real dataset. We compare this probability to a binary label indicating whether the datapoint 
is real or not, and backpropagate the error through both D and G. The network’s weights are 
updated with each training datapoint to follow Vw£ downwards until the distribution of G(z) 
closely resembles that of the real dataset. Once trained, G can be used to generate entirely novel 
synthetic data that closely resembles (but is not identical to) the training set data. 

In Fig. 18b we see that the GAN adversarial loss can be used to translate between image 
domains [162]. In Isola et al.’s Pix2Pix model, the generator takes as input an image x, and 
attempts to produce a related image y. Meanwhile, the discriminator attempts to discern whether 
the (x, y) pair that it is given is sampled from the training set, or the generator. Otherwise, Pix2Pix 
is trained in the same way as the standard GAN. 


6(c) Score-based generative modelling and diffusion models 


Diffusion models were introduced by Sohl-Dickstein et al. [163] and were first shown to be 
capable of producing high quality synthetic samples by Ho et al. [164]. Diffusion models are 
part of a family of generative deep learning models that employ denoising score matching via 
annealed Langevin dynamic sampling (first explored by Hyvärinen [165], Vincent [166]. More 
recent work can be found in Ho et al. [164], Song and Ermon [167], Jolicoeur-Martineau et al. 
[168, 169], Song et al. [170]). This family of score-based generative models (SBGMs) can generate 
imagery of a quality and diversity surpassing state of the art GAN models [161], a startling result 
considering the historic disparity in interest and development between the two techniques [170- 
173]. SBGMs can super-resolve images [174,175], translate between image domains [176], separate 
superimposed images [177], and in-paint information [170,174]. 

Diffusion models define a diffusion process that projects a complex image domain space 
onto a simple domain space. In the original formulation, this diffusion process is fixed to a 
predefined Markov chain q(x; | x:—1) that adds a small amount of Gaussian noise with each step. 
As Fig. 19 shows, this ‘simple domain space’ can be noise sampled from a Gaussian distribution 
Ep N (0, 1). 


6(c.i) Forward process 


To slowly add Gaussian noise to our data we define a Markov chain 


T 
q(xo...r) = 4(x0) | | a% | x-1). 
t=1 


The amount of noise added per step is controlled with a variance schedule { 5; € (0, 1)}4_1: 


q(xı | Xt-1) =N (xt; V1 — bt Xt—1, btl). (6.3) 
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D(x) or D(G(z)) 


Discriminator 


Generator 


(a) A typical GAN according to Goodfellow et al. [161]. z is a noise vector, x is a sample from the 
training set. The discriminator learns to classify the incoming images as either fake or real, and 
the generator learns to fool the discriminator by producing realistic fakes. 


D(x + G(x)) 
D(x+y) 


U-Net Generator 


Discriminator 


(b) A Pix2Pix-like model with a U-Net generator [113,162]. The discriminator learns to classify 
the incoming image tuples as either fake or real. Meanwhile, the generator learns to fool the 
discriminator by approximating the colourisation function mapping x > y. 


Figure 18: The GAN and Pix2Pix models. 
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Figure 19: It is easy (and achievable without learnt parameters) to add noise to an image, but more 
difficult to remove it. Diffusion models attempt to learn an iterative removal process via training 
an appropriate neural network, pg (x;—1 | Xt). 


This process is applied incrementally to the input image. Since we can define the above equation 
such that it only depends on xg we can immediately calculate an image representation x; for any 
t [164]. If we define az = 1 — b: and a = J [i ai: 
Xt = VAt Xt—1 + V1 — Qt Zt_-1 
= atati x42 + V/(1 — at) + at (1 — at—1) Zt—2 
= yara 1012 X13 + V/(1 — aray—1) + ataz_i(1 — a4_2) Zt-3 


Varxo + V1 — azz, (6.4) 


where z~ N (0,1) and Z is a combination of Gaussians. Plugging the above expression into 
Eq. 6.3 removes the x;_1 dependency and yields 


a(t | x0) =N (xt; Vā xo, (1 — &)1). 
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6(c.ii) Reverse process 


Diffusion models attempt to reverse the forward process by applying a Markov chain with learnt 
Gaussian transitions. These transitions can be learnt via an appropriate neural network, pg: 


T 
po(xo...T) = p(xr) | | po(xe-1 | xt), 
i=1 


po(Xt—1 | Xt) =N (xt—1; Mo (Xt, t), Ho (xe, t)). 


While 9 (x+, t) can be learnt”, the Ho et al. [164] formulation fixes Xo to an iteration-dependent 
constant o? 1, where o? = 1 — oz. 

By recognising that diffusion models are a restricted class of hierarchical VAE” , we see that 
we can train pg by optimising the evidence lower bound [ELBO, introduced in 160] that can be 
written as a summation over the Kullback-Leibler divergences at each iteration step’: 


E2? 


LrLBO = Eq [Pex (ater | xo) ||p(<r))+ 


XO Dex (a(xe-1 | xt, xo)llpo(x:-1 | Xe) + log po (xo | xı)] i (6.5) 
i51 


In the Ho et al. [164] formulation, the first term in Eq. 6.5 is a constant during training and the 
final term is modelled as an independent discrete decoder. This leaves the middle summation. 
Each summand can be written as 


1 
L( mt, Ho) = Foz [Htt x0) — pro (xz, t)’, (6.6) 
A 


where jg is the neural network’s estimation of the forward process posterior mean ur. In practice 
it would be preferable to predict the noise addition in each iteration step (z;), as z has a 
distribution that by definition is centred about zero, with a well defined variance. To this end 
we can define pọ as 


eN (x lt), (6.7) 


and by combining Eqs. 6.6 and 6.7 we get 


2 
Lat, Zo) 1 : x Peis z x PE les n) 
t, 40 202 Vat t I-a t Ja t Te O\St, 
1— a4)? 
Oot) le — ng (xe, I. (6.8) 


~ 2oZar(1 — Gt) 


Ho et al. [164] empirically found that a simplified version of the loss described in Eq. 6.8 results 
in better sample quality. They use a simplified version of Eq. 6.8 as their loss, and optimise to 
predict the noise required to reverse a forward process iteration step: 


L(Zt, zo) = |z — Zo (Xt, DI, where x; = VatXo + V1 — QiZt. (6.9) 


By recognising that z: = 07 Vx, log q(x: | x+—1), we see that Eq. 6.9 is equivalent to denoising score 
matching over t noise levels [166]. This connection establishes a link between diffusion models 
and other SBGMs [such as 167,168,180]. 


8See for example Nichol and Dhariwal [171]. 
Denoising autoencoders (§6(a)) have an interesting relationship with score-based generative (or diffusion) models. As a 
taster, Turner [178] reframe diffusion models as a class of hierarchical denoising VAE, and Dieleman [179] show through a 


brief derivation that diffusion models optimise the same loss as a denoising autoencoder. 
30See Appendix B in Sohl-Dickstein et al. [163] and Appendix A in Ho et al. [164] for the full derivation. 
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To run inference for the reverse process, one progressively removes the predicted noise zg from 
an image. The predicted noise is weighted according to a variance schedule: 


Xt-1= l x l= Zo(xt,t) } + oz 
t—1 t J ty tZ. 


If we take p(x) ~ N (xr; 0, 1), we can use pg to generate entirely novel data that are similar, 
but not identical to, those found in the training set. 


6(c.iii) Denoising diffusion implicit models 


Ho et al.’s diffusion model performs inference at a rate orders of magnitude slower than single 

shot generative models like the VAE (§6(a)) or the GAN (§6(b)). This is because diffusion models 

need to sequentially reverse every step in the forward process Markov Chain. Reducing the 

inference time for diffusion models is an active area of research [169,181,182], and here we will 

review one proposed solution to the problem; the denoising diffusion implicit model [DDIM; 183]. 
Song et al. [183] propose the following decomposition of Eq. 6.4: 


Xt-1 = +f Gt-1 X0 + V1 — Ge_1 29 (Xz, t) 


= O41 X0 + 1/1 — G1 — 02 Zo + Otz 
x V1 — az 
= yami | 4 ) 4y 1- ea — oF Zo + otz. 
V At “SH 


eS 
ed 
vector towards x+ 


noise 
Xo prediction 


Intuitively, the first term can be thought of as the prediction of the input image xo, given an 
iteration step t. The second term can be thought of as a vector from x;_1 towards the current 
iteration step image x+. The third term is random noise. If we substitute in x; from Eq. 6.9 we 
make this intuition explicit: 


= = 2 Xt— VatXo | 
Xt-1 = y Ot-1 X0 + 1/1 — G1 — oF + O¢Zt. 
V1l— at 


If we then set 0, =0, we remove the noise dependency and the forward process becomes 
deterministic: 


Xt — VLX 
appim (Xt—1 | Xt, X0) = \/Ge-1 X0 + V1 — t1 AE es (6.10) 


This means that DDIMs can deterministically map to and from the latent space, and so inherit 
all the benefits of this property. For example, two objects sampled from similar latent vectors 
share high level properties, latent space arithmetic is possible, and we can perform meaningful 
interpolation within this space. We demonstrate DDIM latent space interpolation in Fig. 20. 


Figure 20: Meaningful latent space interpolation via a DDIM model [183,184]. This property 
comes ‘for free’ with most other generative models, however the Denoising Diffusion Probabilistic 
Model [164] requires a tweak to its sampling scheme (Eq. 6.10). 
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We can also subsample every 7 number of steps at inference time, where 7 is a set of evenly 
spaced steps between 0 and T, the maximum number of steps in the forward process: 


Xr; / At X 
qppiIm(Xri-1 | X7;,X0) = /Gt-1 x0 + V1 — at-1 ern 2, (6.11) 
— At 


As shown in Song et al. [183] this results in acceptable generations with a T/r inference speed up. 


7. Representation learning 


Self-supervised representation learning has recently exploded in popularity, with a slew of 
models being developed in rapid succession [e.g. 185-189]. At its core, representation learning 
attempts to produce semantically meaningful compressed representations (or embeddings) 
of complex highly dimensional data. Aside from simply being a compression device, these 
embeddings can also be taken and used in downstream tasks, like clustering, anomaly detection, 
or classification. 

In this section we will describe two approaches to representation learning that are popular 
within astronomy. The first approach uses contrastive learning as defined by the SimCLR model 
[185,186]. The second approach defines and uses a ‘surrogate task’ (such as autoencoding or 
next value prediction) to train a deep learning model, and extracts semantically meaningful 
representations from the subsequent trained network. 


7(a) Contrastive learning 


Fig. 21 describes a simple contrastive learning model similar to SimCLR [185]. This model takes 
as input a sample x from the training set, and augments it to produce A(x). This augmentation is 
performed in such a way that A(x) shares enough semantically meaningful data with x to belong 
to the same class. In the contrastive learning literature (x, A(x)) is known as a positive pair. This 
positive pair is passed to a Siamese neural network ®, which projects the high dimensional input 
data onto a lower dimensional ‘embedding space’. All other training set samples are assumed to 
belong to a different class to x, and so can be combined with x to produce ‘negative pairs’. Once 
we produce some embeddings we need to define a loss that clusters similar samples together, 
while simultaneously pushing away dissimilar samples. Hadsell et al. [190] propose such a loss— 
the maximum margin contrastive loss: 


Lai, Zi) = dij Zi Zj + (1 — 6;;) max(0, m — Zi zj), 


where ô is the Kronecker delta, z; and zj are embedding vectors*!, and m is the margin. If z; and 
zj are a positive pair, the loss pulls the embeddings closer, and if they are a negative pair the loss 
pushes the embeddings away from each other. The margin imposes an upper distance bound on 
dissimilar embeddings. 

While useful, the maximum margin contrastive loss does not take into account the embedding 
space beyond the pair it is attending to in each training step. This limitation ultimately results in 
a less expressive embedding space. The triplet loss [191] solves this issue by taking into account 
the broader embedding space and simultaneously attracting a positive pair while repulsing a 
negative pair with each training step: 


L(Zi, Zj, Zk) = max(0, z zj — Zs Zp +m) (7.1) 


where zę is a sampled from a different class to z;, and z; is sampled from the same class as z;. 
If we study Eq. 7.1 we see that it is possible to generalise our loss even further, taking into 
account an arbitrary number of negative samples. The normalized temperature-scaled cross 


31 All embeddings in this subsection are normalised. 
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(a) Possible application to imagery. (b) Possible application to sequential data. 


Figure 21: A simple contrastive learning model is applied to both imagery and sequential data. A 
is an augmentation pipeline. For imagery, A could consist of random crops, noise addition, and 
colour jitter. For sequential data, A could consist of noise addition, stochastic temporal shifting, 
and random data deletion. is a function approximator that projects inputs onto an embedding 
space. @ is typically a neural network: when processing imagery, ® could take the form of a CNN, 
and when processing sequential data could be an RNN. The loss £ measures the distance 
between the embeddings (x) = z; and &(A(x)) = zj, and we train by attempting to minimise 
this distance while maximising the distance between dissimilar samples. 


entropy loss [NT-Xent; 192] does precisely this: 


Tz. 
L(Zi, Zj) = log ( ~ exp(Zj Zj /T) (7.2) 


pai (1 — dei) exp(zf zk/T) 


where z; and z; are a positive embedding pair, and z; and z, are a negative pair. T is a 
‘temperature’ hyperparameter introduced in Chen et al. [185] to help the model learn from hard 
negatives (negatives closer to the anchor than the comparison positive, see Fig. 22b). 


7(b) Learning representations via a surrogate task 


One can also learn representations via a surrogate task. A surrogate task is any task that is 
unrelated to the network’s final use. However, in the process of learning to perform the surrogate 
task, the network learns what is important, and what is unimportant about data within the 
training set. This information can then be extracted in the form of learnt representations. If the 
surrogate task is general enough, these representations will contain useful semantic information 
about the items in the dataset, and can then be used for downstream applications (i.e. clustering, 
classification, anomaly detection). 

Let us concretise this process by revisiting an example that we previously discussed in §4(b). 
Let us imagine we have a large set of galaxy rotation curves that we want to extract embeddings 
from. We could train an LSTM model (Fig. 23) on the task of predicting the next item in the 
rotation curve, with the model only having access to the previous items in the profile. Once the 
LSTM model is trained on this task, we can feed in a full, new rotation curve, and repurpose 
the final hidden state as a representative embedding. Note that this set up does not rely on any 
external labels, only on the rotation curve itself”. 

We can generate embeddings via an autoencoding task. Again, let us use an astronomical 
example to specify this, and say that we want to extract embeddings from a set of galaxy 


3?This self-supervised training set up is similar to that used to train autoregressive foundation models. These models will be 
explored in detail in §9(a). 
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(a) The triplet (Eq. 7.1) and NT-Xent (b) Types of negative embeddings. z; and zj form 
(Eq. 7.2) losses simultaneously incentivise a positive embedding pair. If a negative is closer 
attraction between embeddings sampled than the current positive it is considered a hard 
from the same class (z; and zj), and negative, if it lies within the margin it is considered 
repulsion between embeddings sampled a semi-hard negative, and if it is beyond the 
from different classes (z; and zz). margin it is considered an easy negative. 


Figure 22: More information on self-supervised embeddings. Fig. 22a depicts the inner workings 
of the triplet and NT-Xent losses, and Fig. 22b shows the three possible negative embedding types 
as described in the literature. 


observations. We could repurpose a variational autoencoder for this, training it as normal as 
described in §6(a). However, once the model is trained we would discard the decoding part of 
the network, and only consider the encoder. To generate embeddings we would then simply pass 
in our galaxy images to the trained encoder. The same process can be carried out by a GAN (§6(b)). 
In the GAN case, we would discard the generator after training, and use the discriminator’s 
penultimate layer outputs as our embeddings. 

Supervised networks can also be used to generate embeddings. If a network has been trained 
in a supervised manner to classify or regress data, it will have learnt some properties about that 
data that helps it to carry out its task. We can access these learnt representations by taking the 
outputs from a trained network’s penultimate layer as an embedding”. 


8. Astronomy’s third wave of connectionism 


Since its astronomical debut in the mid-2010s [196]**, deep generative modelling has become a 
popular subfield within astronomical connectionism. This popularity is driven by its inherent 
scalability; the lack of a need for labelled data allows the methods to be repurposed for any 
dataset that might be at hand. Self-supervised connectionism has been around for longer [i.e. 
198], but again has recently exploded in popularity due to its usefulness in wrangling enormous 
unlabelled datasets. This section is split into two major parts. We will first outline the history of 
deep astronomical generative modelling in §8(a), and the history of astronomical representation 
learning will be discussed in §8(b). Although representation learning is the explicit goal for only 


Interestingly, this process is used in the calculation for the Fréchet Inception Distance (FID) [193,194]. The FID acts as a 
measurement of the visual similarity between two datasets. The FID works by taking the penultimate layer representations 


from a trained Inception-v3 model [195] for each dataset and calculating the distance between them. 
34 Also compare its companion paper [197]. 
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(a) While training we feed in the galaxy rotation curve, and predict the next observation in its 
sequence. 


LSTM|—LSTM- `- ts 
E 2 


(b) While inferring we feed in the full galaxy rotation curve, and extract the LSTM hidden state 
as a compressed representation embedding of the curve. Otherwise, we ignore whatever output 
(i.e. {p1, . - - , py }) the LSTM generates. 


Figure 23: A hypothetical surrogate task for extracting rotation curve representations is shown. 
{z0,.. . , £y } is a set of observations from a galaxy rotation curve, in order of radial distance from 
the galactic centre. {p1, . . . , py } is the LSTM’s corresponding set of predictions for {£1,..., £y }- 
h is the LSTM hidden state vector. See Fig. 11 for more about the internal workings of the LSTM. 
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the studies described in §8(b), it must be stressed that representations can also be extracted from all 
the deep generative models described in §8(a). 


8(a) Deep astronomical generative modelling 


Seminally, Regier et al. [196] proposed the use of a VAE to model galaxy observations. They 
trained their network on downscaled 69 x 69 crops of galaxies from a SDSS-sampled dataset 
containing 43444 galaxies. They trained their network in the same way as described in §6(a), 
and find that the network is capable of generating galaxies similar to those found in the training 
set. They also find that their network produces semantically meaningful embeddings, noting that 
their galaxies are clustered by orientation and morphological type. This same line of enquiry 
was followed by Ravanbakhsh et al. [199], who showed that VAEs could be used to generate 
galaxies conditionally. Ravanbakhsh et al. [199] also pioneered the use of GANs to generate 
galaxy imagery. Spindler et al. [200] used a VAE combined with a Gaussian mixture model prior 
(see Eq. 6.2 and accompanying text) to generate and cluster galaxy images into morphological 
types. While the previous studies in this paragraph used images with relatively small pixel 
dimensions in their training set, Fussell and Moews [201] and Holzschuh et al. [202] demonstrated 
that GANs are capable of generating large high fidelity galaxy observations. Fussell and Moews 
[201] achieved this with a stacked GAN architecture [203], and Holzschuh et al. [202] use the 
related StyleGAN architecture [204] to the same end. Bretonniére et al. [205] use a flow-based 
model” [207,208] to conditionally simulate galaxy observations. They found that their approach 
could produce more accurate simulations than the previous analytical approach, at the cost of 
inference time. Relatedly, Smith et al. [184] use a diffusion model to generate large high fidelity 
galaxies. They trained their network on two datasets comprised of galaxies as observed by the 
Dark Energy Spectroscopic Instrument [DESI; 209]. One, a set of 306 006 galaxies catalogued in 
the SDSS Data Release 7 [74,210,211], and the other a set of 1962 late-type galaxies, as catalogued 
in the Photometry and Rotation curve OBservations from Extragalatic Surveys [PROBES; 212] 
dataset. PROBES contains well resolved galaxies that exhibit spiral arms, bars, and other features 
characteristic of late-type galaxies. They found that their model produces galaxies that are 
both qualitatively and statistically indistinguishable from those in the training set, proving that 
diffusion models are a competitive alternative to the more established GAN and VAE models for 
astronomical simulation. From all of these studies we can conclude that deep generative models 
can internalise a model capable of physically and morphologically describing galaxies. 

Generative models have also been used to simulate astronomical data on larger scales. In 
a use-case tangential to galaxy generation, Smith and Geach [213] show that a Spatial-GAN 
[214] can simulate arbitrarily wide field surveys. They train on the Hubble eXtreme Deep 
Field, and find that galaxies ‘detected’ within their model’s synthetic deep fields are statistically 
indistinguishable from the real thing. Cosmological simulations have also been explored, with 
Rodriguez et al. [215] using a GAN to generate cosmic web simulations at pace, and Mustafa et al. 
[216] generating weak lensing convergence maps at a pace faster than classic simulations. Beyond 
GANs, Remy et al. [217]}°° trained a SBGM on simulated maps from MassiveNus [219], and found 
that their model was capable of replicating these maps. They also demonstrated that their model 
was capable of producing a likely spread in the posterior predictions. Finally, they demonstrate 
that a SBGM is capable of predicting the mass map of the real Hubble Cosmic Evolution Survey 
(COSMOS) field [220]. 

The image domain translation abilities of GANs in a Pix2Pix-like formulation [162, also see 
Fig. 18b.] is particularly useful in astronomy. Schawinski et al. [221] demonstrated this use first 
by training a Pix2Pix-like model to denoise astronomical data. They trained their network on 
4550 galaxies sampled from SDSS. The galaxies were convolved to increase the seeing, and 
speckle noise was added. The GAN was tasked with reversing this process. They found that 
3 Flow-based models have not been discussed in detail in this review, but see Weng [206] for a magisterial introduction to the 


subject. 
36 This preliminary work has been subsequently extended in Remy et al. [218]. 
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their method outperformed both blind deconvolution, and Lucy-Richardson deconvolution. 
Generative models are also capable of separating sources, as Stark et al. [222] demonstrate by 
using a Pix2Pix model to deblend a quasar’s point source emission from the extended light of 
its host galaxy. Reiman and Gohre [223] use a similar model to Stark et al. [222] to deblend 
overlapping galaxies. 

At the time of writing there are only three examples of score-based (or diffusion) modelling in 
the astronomy literature [184,217,218]. It is surprising that these studies are the only examples of 
score-based modelling in astronomy, as SBGMs produce generations that rival that of state-of-the- 
art GAN models, without drawbacks present in other models (like blurring in the case of VAEs, or 
mode collapse and training instability in the case of GANs). SBGMs also have some natural uses 
in astronomical data pipelines. For example, an implementation similar to Sasaki et al. [176] could 
be used for survey-to-survey photometry translation similarly to Buncher et al. [224]. The source 
image separation model described in Jayaram and Thickstun [177] has the obvious application as 
an astronomical object deblender [i.e. 222,223,225]. To summarise, SBGMs are ripe for exploitation 
by the astronomical community and we hope to see much interest in this area in the coming years. 


8(b) Self-supervised astronomical representation learning 


In 1993, Serra-Ricart et al. [198] proposed using an autoencoder to learn embeddings for stars 
as observed by the Two Micron Galactic Survey [226]. They first proved that their autoencoder 
model worked better than principle component analysis (PCA) on the toy problem of separating 
Gaussian distributions, and they then showed that their model also outperformed the classic PCA 
method on real data. More than twenty years later, Graff et al. [227 showed that autoencoders 
are also capable of capturing the properties of galaxies as described in the Mapping Dark 
Matter Challenge [228] by demonstrating that embeddings extracted from their autoencoder 
were beneficial for computing the ellipticities of their galaxies as a downstream task. We are 
not limited to imagery; Yang and Li [229] show that an autoencoder can learn representations 
that can then be used to train a neural network for the downstream task of estimating stars’ 
atmospheric parameters, and Tsang and Schultz [230] demonstrate that an autoencoder can 
generate embeddings that can then be used to classify variable star light curves. From these 
studies we must conclude that neural networks trained via a surrogate task are capable of learning 
semantically meaningful embeddings across astronomical domains. 

Very recently there has been work applying self-supervised contrastive learning models 
to galaxy image clustering. Hayat et al. [231] trained SimCLR [185] on multi-band galaxy 
photometry from the SDSS [74]. They show that the resulting embeddings capture useful 
information by directly using them in a training set for a galaxy morphology classification model, 
and a redshift estimation model. Similarly, Sarmiento et al. [232] trained SimCLR on integral 
field spectroscopy data captured from galaxies in the Mapping Nearby Galaxies at Apache 
Point Observatory survey [MaNGA; 233]. Again, they find that SimCLR produces semantically 
meaningful embeddings. Slijepcevic et al. [234] demonstrate that the ‘Bootstrap Your Own 
Latent’ [BYOL; 187] contrastive learning model is capable of learning semantically meaningful 
representations of radio galaxies. Their model is trained on 100000 Radio Galaxy Zoo galaxies, 
and inference is run on the 1256 galaxy strong Mirabest dataset [235]. They find that embeddings 
derived from their model are semantically meaningful, suggesting that self-supervised methods 
are transferable between disparate surveys. These studies show that contrastive learning is 
applicable to imagery, further study will be required to demonstrate its effectiveness with other 
types of astronomical data, such as time series and volumetric data. 


37See Footnote 43 for commentary of this study in the context of astronomical foundation models. 
38A contrastive learning framework that unlike SimCLR does not use negative samples to learn an embedding space. 
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9. Foundation models: a fourth astroconnectionist wave? 


This review has shown thus far that deep learning has found wide use in astronomy, a use 
predicated on the availability of enormous amounts of computational power and data. This 
section looks to the future and predicts an outcome if astronomy continues to follow in the 
footsteps of other applied deep learning fields. In short, we predict and argue that astronomical 
connectionism will likely see the removal of expertly crafted deep learning models, to be replaced 
with an all encompassing ‘foundation’ model. In §9(a) we explore what foundation models are, 
and their context within deep learning. §9(b) then contextualises these models within astronomy, 
and suggests actions we we can take as a community to realise an astronomical foundation model. 
Finally, §9(c) demonstrates as a thought experiment a state-of-the-art use-case for an astronomical 
foundation model. 


9(a) Foundation models 


Since its inception, connectionism has followed a path of greater compute and greater generality 
[84,85]. In that time, human crafted biases have fallen by the wayside, to be replaced with models 
and techniques that learn directly from data. Sutton [84] exemplifies this process via the field of 
speech recognition: 


In speech recognition, there was an early competition, sponsored by DARPA [Defense Advanced 
Research Projects Agency], in the 1970s. Entrants included a host of special methods that took 
advantage of human knowledge—knowledge of words, of phonemes, of the human vocal tract, etc. 
On the other side were newer methods that were more statistical in nature and did much more 
computation, based on hidden Markov models (HMMs). Again, the statistical methods won out 
over the human-knowledge-based methods. This led to a major change in all of natural language 
processing, gradually over decades, where statistics and computation came to dominate the field. The 
recent rise of deep learning in speech recognition is the most recent step in this consistent direction. 
Deep learning methods rely even less on human knowledge, and use even more computation, together 
with learning on huge training sets, to produce dramatically better speech recognition systems. As in 
[computer Go and computer chess], researchers always tried to make systems that worked the way the 
researchers thought their own minds worked—they tried to put that knowledge in their systems— 
but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through 
Moore's law, massive computation became available and a means was found to put it to good use. 


We are seeing this principal play out once again through a new paradigm shift in deep learning, 
where even the underlying neural network architecture does not matter. Previously, neural 
networks were adapted for a specific domain via inductive biases injected by researchers, such 
as convolutions for computer vision, and recurrence for language processing. Now we are seeing 
transformer networks [see §4(d) and 110] competing”? in all deep learning domains applied or 
otherwise: from language processing [12,116]*° to computer vision [13,153] to graph learning 
[238] to protein folding [11] to astronomy [154,155,157]. The transformer’s versatility allows us 
to take a model trained on one task and apply it to a similar yet different task, a process known 
as transfer learning. For example, we could train a model on the ‘surrogate’ task of predicting the 
next word in a sequence, and then apply that model to a similar yet different task of predicting 
the answer to a geography question. In this example the first model is known as a ‘foundation’ 
model, and the downstream model is derived from it. This set up brings with it some useful 
advantages. For example, if the foundation model is improved, all downstream tasks also see 


°For now! It may be that network architecture does not matter all that much at scale, and that any sufficiently large neural 
network is adequate. If this is true, we will see the simplest (and most scalable) architectures win out. Although this theory has 
not yet been rigorously tested, we are currently seeing rumblings that suggest that this is the case [e.g. the section ‘Transformers 
are not special’ in 236]. Bo [237] stands as a particularly notable example of this hypothesis, showing that an attention-free RNN 
is capable of matching the performance of a similarly-scaled transformer network. Also see Footnote 13 for commentary on 
the performance capabilities of MLPs and transformers. 

“These models are collectively known in the literature as large language models, or LLMs. 
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improvement. Therefore, the need for only one model allows researchers to pool their efforts in a 
way not possible when resources are split between many projects. 

To train a foundation model, we first need to define a surrogate task. As labelled datasets 
are expensive, and raw data is relatively cheap, the easiest and most scalable way to do this is 
via self-supervised learning’. Self-supervised learning does not require a human to provide a 
labelled dataset for training. Instead, the supervisory signal is generated automatically from the 
raw data. For example, in the context of astronomy this task could be predicting a masked value in 
a variable star’s light curve [154]. Another task could be using an autoencoder (§6(a)) to replicate 
a galaxy observation [200]. A further task could be training within a self-supervised framework, 
like contastive learning (§7(a)). The important thing about self-supervised learning is that it does 
not require annotated data. This means that we can leverage vast reserves of raw data (such as 
textbooks, scraped Internet text, raw imagery, etc.). 


Input prompt Completion 
This is a This is a shiba. a Se a 
chinchilla. They They are very nF TE EE 
f F This is in the 
are mainly popular in i 
found in Chile Japan. Ciian em 
a South America. 
What is the title Where is this What is the 
of this painting? painting name of the city 
Answer: The displayed? where this was Arles. 
Hallucinogenic Answer: Louvres painted? 
Toreador. Museum, Paris. Answer: 
Ly i Output: Output: : p 7 
UNDERGROUND ‘Underground’ "CONGRESS PE ‘Congress’ Output: ‘Soulomes 
——— n 241=3 = 5+6=11 wees 3x6=18 
pandas: 3 dogs: 2 giraffes: 4 
, my favourite i 
; + i , my favorite Dreams from 
1 like reading play is Hamlet. l book is my Father. 
also like 


Figure 24: Flamingo is a foundation model that is capable of understanding images within the 
context of natural language. Here we see some examples of Flamingo's emergent abilities. This 
figure is adapted from Fig. 1 in Alayrac et al. [239]. 


Very large models trained on vast amounts of data demonstrate surprising emergent 
behaviour. For instance, GPT-3 [12] is a 175B parameter model that can be ‘prompted’ to perform 
a novel task (see Fig. 24 for more on prompting foundation models). This ability was not shown at 
all in GPT-3’s older, smaller 1.5B parameter sibling [115]. Furthermore, a meta-study described in 
Wei et al. [240] found that larger models suddenly ‘unlock’ abilities such as arithmetic, translation, 
and understanding of figures of speech once they reach a certain scale. These findings suggest that 
architectural changes are not required beyond scaling to perform many tasks in natural language 
processing [85,241]. In Fig. 24 we see some results from Alayrac et al. [239], a model comprising of 


“For more on self-supervised learning see §7. 
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an LLM, and an image encoder. In this figure we can see that the model is capable of arithmetic, 
reading, counting, and has a broad knowledge (albeit not ‘understanding’) of art, geography 
and zoology”, and literature. This model is comprised of a ResNet variant [112,243] to encode 
imagery, and the Chinchilla LLM [244] to encode and generate text. Chinchilla (and therefore 
Flamingo) was trained with the surrogate task of predicting the next word in a text sequence, and 
so none of the emergent properties stated above were explicitly optimised for. 

In the next subsection, we will state and explain the need for an astronomical foundation 
model*?, not only for astronomy’s sake, but also for the sake of openness in deep learning 
research. 


9(b) Scaling laws and data moats 


Hoffmann et al. [244] suggested an update to the foundation model scaling law first proposed in 
Kaplan et al. [246]. Their scaling law equation relates the size of a neural network model and the 
training dataset size to the minimum achievable loss. Mathematically, the equation is 


A B 
Lmin(N, D) = Na F De 25 E , (9.1) 
be att “ dataset entropy 


parameter term data term 


where F is a constant that represents the lowest possible loss, given a particular training dataset. 
N is the number of trainable parameters within the neural network, and D is the size of the dataset 
in tokens (see §4(d) for more about tokenisation). We can see that when we have an infinitely 
large model trained on an infinitely large dataset (i.e. N = D = 00), the only term remaining is the 
‘dataset entropy’ constant, Æ. We can therefore only reduce the loss by increasing the size of our 
model, or the size of our training set. 
After fitting Eq. 9.1, Hoffmann et al. [244] find 
406.4 410.7 


Lmin(N, D) = F031 H D028 + 1.69. 


If we then plug in N and D for a selection of real foundation models we arrive at Fig. 25. We can 
see in Fig. 25 that the model size term for real foundation models is far lower than the dataset size 
term. This means that an increase in dataset size has the potential to reduce the minimum loss by 
a far larger amount than a larger model would. Therefore, an obvious next step to improve these 
foundation models further is by increasing their dataset size. 

The largest dataset [MassiveText-English; 244] in the comparison shown in Fig. 25 amounts to 
1.4T tokens. However, this dataset is proprietary, being only available to researchers employed by 
Google. The largest public text dataset available at the time of writing is The Pile [250], with a total 
size of ~260B tokens. We could increase the size of these datasets by indefinitely scraping text data 
from the surface web, but this data tends to be of low quality. Also, we have already exhausted 
some important high quality data reserves, like fundamental research papers, and open source 
code [251]. We also have to ask ourselves: what happens when generative models start to create 
data en masse, and dump it indiscriminately onto the Internet? If a significant proportion of text in 
a dataset scraped from the Internet is generated via an LLM, training on it will cause unforeseen 


“Interestingly, the authors of Flamingo first assumed that Flamingo’s prediction of the species range of its eponymous bird 
was incorrect: flamingos are found in the Caribbean, South America, Africa, Europe, and South Asia. However, they later 
realised that the picture in Fig. 24 is of an American flamingo, which is specifically found in the Caribbean and South America, 
so the network was right after all! See the reddit thread for the full context [242]. 

‘Walmsley et al. [245] explore in a preliminary study a ‘galaxy foundation model’ trained on Galaxy Zoo labels, and 
corresponding paired galaxy observations. They find that their pretraining is beneficial for training a network that performs 
a downstream task. However, the idea has been around for far longer that that; possibly the first demonstration of an 
astronomical foundation model was described eight years earlier in Graff et al. [227]. Graff et al. [227] demonstrated that 
embeddings learnt with their autoencoding SkyNet network can be used for downstream tasks, but they do not use the 
moniker ‘foundation model’ to describe SkyNet as the term had not yet been invented! Notably, neither study trains a model 
of the scale required to exhibit emergent properties or task generalisability. These ‘blessings of scale’ require data and compute 
at a level that has not yet been seen within astronomical connectionism. 
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A/N® 
Model N D AIN® BID? | Lmin 
LaMDA [247] 137B 168B 0.066 0.295 | 2.051 
GPT-3 [12] 175B 300B 0.061 0.251 | 2.002 


Gopher [248] 280B 300B 0.052 0.251 | 1.993 
MT-NLG [249] 530B 270B 0.041 0.259 | 1.990 
Chinchilla [244] 70B 14T 0.083 0.163 | 1.936 
PaLM [241] 540B 780B 0.042 0.192 | 1.924 


Figure 25: A comparison between the minimum losses of a selection of foundation models. The 
table above shows the number of parameters in a model (N), the number of tokens within that 
model's training set (D), and their corresponding calculated emergent terms from Eq. 9.1. Here 
we use Hoffmann et al. [244] to source values for A, a, B, and 8. The minimum loss for each 
model according to Hoffmann et al. [244] is shown as Lmin. The contour plot shows the emergent 
parameters B/D? and A/N® plotted against each other for our models. The closer the models’ 
scatterpoints are to the bottom left, the lower their minimum loss value. 


issues and may ultimately result in a model with worse performance. We must therefore ensure 
that the data is not generated by a deep generative model. In addition to all this, the academy and 
the public at large will never have access to the vast reserves of data contained in the deep web 
administered by ByteDance, Google, Meta, Microsoft, and other tech giants. For all these reasons, 
we will need to think outside the box if we want to mine new high quality data. 

Enter the multimodal foundation model. Reed et al. burii demonstrated that a large 
transformer neural network is capable of learning many tasks, from playing Atari, to captioning 


“Earlier work from Kaiser et al. [252] also demonstrated a deep learning model that could learn from disparate tasks, however 
Gato is the first model that achieves this while staying within a single deep learning paradigm. 
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images, to chatting, to operating a real robot arm. The model shares weights across all tasks, and 
decides at inference time from context which task to predict. Importantly, Reed et al. [117] find 
that their model follows the same scaling laws as other foundation models, and so multimodal 
foundation models have the same hunger for data that we see in Fig. 25. We can therefore augment 
(or replace) our text datasets with high quality, publicly available astronomical data. 

The Vera Rubin Observatory’s 189 16 megapixel CCDs will observe 1000 science frames per 
night while conducting LSST [253]. This amounts to 3 x 101? pixels per night, or approximately 
12B tokens a night if we use the same tokenising scheme as Dosovitskiy et al.’s vision transformer 
[13]. After only one year of observing, the LSST will have produced 4.4T tokens of raw data, 
larger than even the MassiveText-English dataset*’. This data, and other astronomical data like it, 
could be compiled into a very large open dataset similar to EleutherAl’s Pile [250]. This dataset 
would provide a way for academics employed outside of Big Tech to train and research very 
large foundation models. Compiling a dataset like this would be difficult for a single relatively 
underresourced research group, but it could be accomplished via bazaar style open development 
[254]. We have already seen this development model succeed in large open source projects, the 
most famous of which is the Linux kernel. This development model has also been shown to 
work within the field of deep learning by EleutherAI [e.g. 250,255,256], and with HuggingFace’s 
BigScience initiative*®. Once compiled, we must ensure that progress is kept in the open, and that 
the data is not simply absorbed into proprietary datasets—to do this we must give our dataset a 
strong (viral) copyleft style licence. 

Once the dataset is compiled all we need are some self-supervised surrogate tasks for our 
‘astrofoundation’ model to attempt. These tasks could include predicting the next observation 
in a variable star’s time sequence, predicting the low surface brightness profile of a galaxy, 
predicting a galaxy’s morphological parameters, or simply generating the next crop in a sequence 
of observations”. Our astrofoundation model will inherit all the interesting properties that LLMs 
enjoy, such as few to zero-shot generation and other emergent behaviours. We could also finetune 
astrofoundation to downstream tasks, saving much time and compute. In the next section we will 
outline one possible downstream task that would be useful in astronomy; a conditional generative 
model for galaxy simulation. 


9(c) A new class of simulation 


If we train an unconditional generative model, we cannot control its output at inference time. This 
is an issue if we want to generate specific classes of observations to train models for downstream 
tasks, such as redshift estimation, or galaxy type classification. To achieve a model capable of 
generating specific classes, one could simply train a conditional generative model of the form 


Gg(x|z,y), (9.2) 


where x is a generated image, z is some noise that acts to capture all detail not encoded in y, and 
y is a conditioning vector. As an example, y could contain a galaxy’s redshift or morphological 
type. However, this means that we must be very specific when choosing y. Multimodal modelling 
provides us the means to sidestep this fundamental issue, and lets us play with fuzzy inputs. 

As a thought experiment let us consider Google’s recent ‘Imagen’ model*®, and imagine how it 
could be repurposed for an astronomical use case [Figs. 26 and 27; 259]. Imagen is a combination 


“SOf course, the reduced, useful data will be far smaller than our raw estimate here. The motivation behind this calculation is 
to show that even a single astronomical survey rivals the largest text dataset in size. A compilation of all useful astronomical 
data would certainly dwarf any contempory text dataset, whether public or proprietary. 

https: //bigscience.huggingface.co 

“This is essentially training the model to act as a physics simulator. Viewing foundation models as world simulators is not 
unprecidented. This perspective has already been explored in the simulation of thousands of ‘social simulacra’ within a model 
online community [257], and with the simulation of participants in classic (i.e. Milgram’s shock experiment, the Ultimatum 
Game) and novel psychological studies [258]. 

‘SNaturally, no implementation is provided by Google. However, there is already a fantastic MIT licenced implementation of 
Imagen provided by Phil Wang and others (https: //github.com/lucidrains/imagen-pytorch). All we are missing 
now is the data and the means to train with it! 
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Figure 26: Select 1024 x 1024 Imagen samples generated from text inputs. Below each image is its 
corresponding conditioning text. Figure adapted from Fig. A.2 in [259]. 


of a frozen LLM [specifically T5-XXL; 260] and a cascaded diffusion model [261, also see §6(c)]. 
The LLM acts as a language encoder, and then passes its generated latent space representations 
onto the diffusion model as a conditioning vector. If we were to replace the frozen LLM with 
an ‘astrofoundation’ model (see §9(a) and §9(b)), we could leverage astronomy’s fundamentally 
multimodal nature. For example, if our astrofoundation model were trained to understand the 
Galaxy Zoo 2 (GZ2) morphological classifications [262], we could take the GZ2 descriptors as y 
and their corresponding galaxy pair as x and train on those. 

Once trained, our astronomical Imagen model could generate synthetic galaxies that 
resemble the real galaxy observations that it was trained on. However, unlike an unconditional 
astronomical simulator, this model would be capable of generating galaxies that specifically 
resemble a real galaxy that shares the conditioning set of GZ2 parameters! 

Unlike the conditional model described by Eq. 9.2, an astrofoundation type model allows us 
to be creative with the conditioning vector. For example, we could run the model in reverse to 
generate representations that refer to a very specific astronomical object, and then generate many 
more objects of that ‘class’ with injected features like satellite occlusion, a specific instrument 
response function, a specific redshift, etc. [see work on ‘textual inversion’ by 263]. We could even 
create a ‘Galaxy Zoo’ type dataset that asks citizen scientists to describe galaxy morphology via 
natural language. This is possible since the encoding foundation model does not fundamentally 
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Figure 27: An Imagen-like model uses a frozen foundation model to encode text, and then uses 
that encoding to condition a cascaded diffusion model of the form G4 (x | z, ¥) [259,261]. Here we 
see one possible realisation of this type of model in astronomy. y is some kind of descriptive vector 
that can be paired with a ground truth image. For example, y could be the surface brightness 
profile of a galaxy, or the summary statistics of a variable star light curve, or some cosmological 
parameters. In general, y could be any vector that the astrofoundation model understands. y is 
y’s projected latent space equivalent. Since we do not need to train the foundation model here, 
training cost is far lower than for an equivalent end-to-end trained model. 


care about which form the caption takes. This approach would cut down on citizen scientist 
training cost due to natural language’s inherent intuitiveness. 


10. A word about the carbon footprint of deep learning 


The training of deep learning models in general requires a considerable amount of energy, and it 
is only natural that the training of ultra-large foundation models significantly ups the ante. In this 
section we illustrate connectionism’s hunger for energy by estimating the total carbon footprint 
created in the training of the GPT-3 and PaLM foundation models [12,241]. 

Let us start with the eminent GPT-3 model. Unfortunately, the total energy cost is not stated 
in Brown et al. [12] but we can make a ballpark estimate using information from that work. GPT- 
3 was trained on a high performance computing cluster containing N = 10000 NVIDIA V100 
chips, and required a total X = 3.14 x 1073 FLOPs to train to completion [12]. A single V100 has 
a throughput of C = 2.8 x 10!3 FLOPS for half precision floats and so we can estimate GPT-3’s 
total training time in datacentre-seconds as 


yX 3.14 x 108 6 
= =1.12x 1 
C- N` 28x 1012. 101 oar 


which is approximately 311 hours. We know the thermal design power of a single V100 chip is 
300 W and so we can safely assume a lower bound on the datacentre power usage as 3000 kW. 
Therefore, we estimate the total power consumed while training GPT-3 as 


3000 - 311 = 933 000 kWh. 
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The emissions per kWh of the datacentre where GPT-3 was trained is 0.429 kg COge kWh 1 [264], 
leaving us with a total emission of around 400000 kg CO2 e”. 

However, GPT-3 is already two years old; so we will also estimate the energy used when 
training Google’s state-of-the-art ‘PaLM’ foundation model. Chowdhery et al. [241] state: ‘We 
trained PaLM-540B on 6144 TPU v4 chips for 1200 hours and 3072 TPU v4 chips for 336 hours including 
some downtime and repeated steps...[We found a] 378.5 W measured system power per TPU v4 chip...’ 
We can therefore calculate PaLM’s total energy usage as 


378.5 - (6144 - 1200 + 3072 - 336) ~ 3 180 000 kWh. 


If PaLM was trained on the same datacentre as GPT-3 (ie. at an emissivity of 
0.429 kg COze kWh7!) it would have emitted a staggering 1400000 kg COze — quadruple the 
average person’s lifetime carbon footprint [265] and approaching the annual emission of some 
small countries. Luckily, the datacentre that PaLM was trained on was far greener than that used 
by OpenAI, and PaLM actually produced ~270 000 kg COze [241], although this is still rather 
large. We contextualise our calculated footprints visually in Fig. 28. 
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Figure 28: Here we contextualise the huge carbon footprints generated when training foundation 
models. The average person’s yearly carbon footprint is estimated as 4750 kg CO2e using data 
from Friedlingstein et al. [265], and the car lifetime emissions is 38 504kg CO2e assuming a 
Mercedes-Benz C 300 d model [266]. 


PaLM’s contribution to Fig. 28 demonstrates the importance of choosing and using datacentres 
that run on clean energy sources when training deep learning models, and make efficient use 


“We must keep in mind that this estimate is a lower limit. We do not include CPU power, cooling, or any other overheads in 
our calculation, never mind the cost to do a full hyperparameter sweep! 
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of heat output (e.g. through recovery systems). Besides this, researchers can also take care when 
optimising their neural network models to reduce their carbon footprint. For instance by choosing 
hyperparameters through a more efficient manual or randomised search, instead of via a brute 
force method [267]. As stated in Strubell et al. [268] researchers can also combat redundant 
retraining of models (and thus unnecessary energy usage) by ensuring that fully trained 
models, data, and code are released under an open licence. The publishing of a fully trained 
model’s energy usage, computation requirements, and carbon footprint also allows downstream 
researchers to determine whether replication of a work is economically and environmentally 
viable. Calculating one’s energy usage in the spirit of openness does not have to be difficult: we 
have been using the excellent and user-friendly ‘Machine Learning CO2 Impact Calculator’ in our 
own work to calculate and publish the carbon footprint of our models [269]. A recommendation 
of this review is that an environmental impact statement should become standard practice in 
journal articles, conference presentations and proceedings when deep learning models (or any 
HPC-heavy research for that matter) is used. 
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Figure 29: Here we see the number of arXiv:astro-ph submissions whose titles or abstracts match 
the terms given in the legend. We can see three distinct ‘waves’. The first corresponds to studies 
that use MLPs (§2(a)-§3), the second corresponds to studies that use ‘deep learning’ methods 
that injest raw data (§4(a)-§5) and the third corresponds to studies that use generative or self- 
supervised models (§6-§8). The raw data is in the public domain, and is available at https: 
//www.kaggle.com/Cornell-University/arxiv. 


11. Final comments, or how we learnt to stop worrying and love 
astronomy’s Big Data Era 


To repeat our introductory statement: in every field that deep learning has infiltrated we have 
seen a reduction in the use of specialist knowledge, to be replaced with knowledge automatically 
derived from data. We have already seen this process play out in many disparate fields from 
computer Go [10], to protein folding [11], to natural language processing [12], to computer vision 
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[13]. This process is already well known within the deep learning community as ‘The Bitter Lesson,’ 
a precept that is summarised by the quote: 


The biggest lesson that can be read from 70 years of AI research is that general methods that leverage 
computation are ultimately the most effective, and by a large margin. [84] 


There is no reason to believe that astronomy is fundamentally different. Indeed, within this 
review we have seen a narrative pointing to this conclusion (Fig. 29). Initial work on MLPs within 
astronomy required manually selected emergent properties as input [e.g. 44,66]. With the advent 
of CNNs and RNNs, these manually selected inputs gave way to raw data ingestion [e.g. 124,139]. 
Now we are seeing the removal of human supervision altogether with deep learning methods 
inferring labels and knowledge directly from the data [e.g. 155,200]. Ultimately, if astronomy 
follows in the footsteps of other applied deep learning fields, we will see the removal of expertly 
crafted deep learning models, to be replaced with finetuned versions of an all-encompassing 
‘foundation’ model [158]. This process is by no means a bad thing; the removal of human bias in 
the astronomical discovery process allows us to find ‘unknown unknowns’ through serendipity 
[154,232]. Likewise, the ability to leverage data allows us to directly generate and interrogate 
realistic yet synthetic observations, sidestepping the need for an expensive and fragile classical 
simulation [184,213]. 

Astronomy’s relative data wealth gives us the opportunity to form a symbiotic relationship 
with the cutting edge of deep learning research, an increasingly data hungry field [85,251]. Many 
ultra-large datasets in machine learning are proprietary, and so the astronomical community has 
the opportunity to step in and provide a high quality multimodal public dataset. In turn, this 
dataset could be used to train an astronomical ‘foundation’ model that can be used for state-of- 
the-art downstream tasks (such as astronomical simulation, see §9(c)). Finally, following recent 
developments in connectionism [12,244] most astronomers lack the resources to train models on 
the cutting edge of the field. If astronomy is to have any chance of keeping up with the Big Tech 
goliaths, we must follow the examples of EleutherAI and HuggingFace and pool our resources in 
a grassroots-style open source fashion (§9(b)). We leave this as a challenge for the community. 
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